csc446: pattern recognition

48
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1 Lecture Note 9: Unsupervised Learning (Chapter 10, Pattern Classification) CSC446 : Pattern Recognition Prof. Dr. Mostafa Gadal-Haqq Faculty of Computer & Information Science Computer Science Department AIN SHAMS UNIVERSITY

Upload: mostafa-g-m-mostafa

Post on 13-Apr-2017

169 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Csc446: Pattern Recognition

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Lecture Note 9:

Unsupervised Learning

(Chapter 10, Pattern Classification)

CSC446 : Pattern Recognition

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Science

Computer Science Department

AIN SHAMS UNIVERSITY

Page 2: Csc446: Pattern Recognition

Unsupervised Learning & Clustering

10.1 Introduction

10.2 Mixture Densities Modee

10.3 Maximum Likelihood Estimates

10.4 Application to Normal Mixtures

10.4.3 K-means algorithm

10.6 Data description and clustering

10.6.1 similarity measures

10.7 Criterion function for clustering

10.9 Hierarchical clustering

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2

Page 3: Csc446: Pattern Recognition

Introduction

• So far, we have investigated “Supervised Learning”, in which training samples were labeled.

𝑫 = { 𝒙𝒊 , 𝒕𝒊 ; 𝒊 = 𝟏, … , 𝒏}

• We now investigate a number of “Unsupervised Learning” procedures, in which unlabeled training samples are used.

𝑫 = { 𝒙𝒊 ; 𝒊 = 𝟏, … , 𝒏}

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3

Page 4: Csc446: Pattern Recognition

Introduction

Supervised learning vs. Unsupervised learning:

• Supervised learning: discover patterns in the data

that relate data attributes with a target (class)

attribute.

– These patterns are then utilized to predict the values of

the target attribute in future data instances.

• Unsupervised learning: The data have no target

attribute.

– We want to explore the data to find some intrinsic

structures in them.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4

Page 5: Csc446: Pattern Recognition

Introduction

• Motivations in unsupervised learning:

1. Labeling large amount of sample data can be very costly.

For example: speech recognition.

2. Training classifiers with large amount of (less expensive)

unlabeled samples, and only then use supervision to label

groupings found. For example: data mining.

3. Unsupervised learning can be used to track slowly

varying features over time, which improves performance.

4. Unsupervised methods can be used to identify features

that will then be useful for categorization.

5. Unsupervised procedures can be used to gain some

insight into the nature (or structure) of the data.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5

Page 6: Csc446: Pattern Recognition

Introduction

• Is it possible, in principle, to learn any thing from unlabeled data?

• The answer to this question depends on the assumptions one is willing to accept!

• We shall begin with the assumption that the functional forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6

Page 7: Csc446: Pattern Recognition

Mixture Densities Model

• We make the following assumptions:

1. The samples come from a known number of

classes c.

2. The prior probabilities P(j) for each class are

known (j = 1, …,c)

3. The form of the class-conditional probability

densities P(x | j, j) are known (j = 1, …,c)

4. The values of the c parameter vectors 1, 2, …,

c are unknown.

5. The category labels are unknown.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7

Page 8: Csc446: Pattern Recognition

• Thus, the probability density function for the samples is given by:

– where is the parameter vector.

– The density function p(x|) is called the mixture density,

– The density functions p(x| j, j) are called a component densities, and

– The prior probabilities P(j) are called the mixing parameters.

c

j

jjj Pxpxp1

)().,|()|( θθ

Mixture Densities Model

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8

Page 9: Csc446: Pattern Recognition

• Our basic goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector .

• Once we know , we can decompose the mixture into its components, and

• we then use a maximum a posteriori classifier on the derived densities; if classification is our final goal.

Mixture Densities Model

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9

Page 10: Csc446: Pattern Recognition

Maximum Likelihoods Estimates

• Suppose that we have a set D = {x1, …, xn} of n

unlabeled samples drawn independently from the

mixture density:

where is fixed but unknown. We have for iid data:

and The gradient of the log-likelihood:

c

j

jjj Pxpxp1

)(),|()|( θθ

)|(ln)(1

n

k

xpl θθ

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10

n

k

k|θxpD|θ p1

)()(

c

j

jjj

n

k k

k

n

k k

Pxpxp

xpxp

l111

)(),|()|(

1)|(

)|(

1)( θ

θθ

θθ θθθ

Page 11: Csc446: Pattern Recognition

Maximum Likelihoods Estimates

• We assume that elements of k and j are

independent of k and j, and we have

Then we can write:

Then is estimated by maximizing

0),|(ln),|(1

iikk

n

k

i xpxPi

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11

)|(

)(),|(),|(

θ

θθ

k

iikki

xp

PxpxP

),|(ln),|(1

iikk

n

k

i xpxPlii

(1)

Page 12: Csc446: Pattern Recognition

Maximum Likelihoods Estimates

• We can generalize these results to include the prior

probabilities P(i) among the unknown quantities:

• Then the MLE can be summarized as:

Find the parameter by:

where

and

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12

c

j

jjk

iikk

Pxp

Pxpx

1

i

)(ˆ)ˆ,|(

)(ˆ)ˆ,|()ˆ,|(P

j

i

θ

θθ

)ˆ,|(ˆ1

)(ˆ θkii xPn

P

0)ˆ,|(ln)ˆ,|(ˆn

1k

iθ θθi ikki xpxP (1)

(2)

(3)

Page 13: Csc446: Pattern Recognition

Applications to Normal Mixtures

• The following table shows a few of the

different cases that can arise depending

upon which parameters are unknown (?)

and which in known (x).

Case i i P(i) c

1 ? x x x

2 ? ? ? x

3 ? ? ? ?

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13

Page 14: Csc446: Pattern Recognition

Case 1: Unknown mean vectors (i = i )

• The log-likelihood is

• The gradient of the log-likelihood:

• Thus, the ML estimates of , according to Eq. 1, is:

Applications to Normal Mixtures

)()(2

1)2(ln),|(ln 12/12/

ii

t

ii

d

iip μxΣμxΣμx

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14

(2)

n

k

ki

n

k

kki

i

P

P

1

1

)ˆ,|(

)ˆ,|(

ˆ

μx

xμx

μ

)(),|(ln 1

iiiipi

μxΣμxμ

Page 15: Csc446: Pattern Recognition

• Unfortunately, Eq. 2 does not give explicitly.

• However, if we have some way of obtaining good

initial estimates for the unknown means,

therefore Eq.2 can be seen as an iterative process

for improving the estimates as follows:

• This is a gradient ascent or hill-climbing procedure

for maximizing the log-likelihood function.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15

n

k

ki

n

k

kki

i

jxP

xjxP

j

1

1

))(ˆ,|(

))(ˆ,|(

)1(ˆ

)0(ˆi

i

Applications to Normal Mixtures

Page 16: Csc446: Pattern Recognition

• We tempted to call it the c-mean procedure, since its

goal is to find the c mean vectors 1, 2, …, c .

However , its now popular with the name k-means.

• It is clear that the probability is large as

the is small.

• Suppose we compute by the Euclidean distance

• Find the means nearest to xk , and approximate

the posterior as:

– Use the iterative scheme to find .

otherwise 0

mi if 1),|(ˆ ki xP

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16

The k-means clustering algorithm

)ˆ,|(ˆ ki xP

)ˆ(ˆ)ˆ( 1

iki

t

ik xx

2

kˆx i

m

c ˆ,...,ˆ,ˆ21

Page 17: Csc446: Pattern Recognition

• We denote the known number of patterns as n, and

the desired number of clusters as c, the k-means

algorithm is then:

Begin

initialize n, c, 1, 2, …, c

do classify n samples according to nearest i

recompute i

until no change in i

return 1, 2, …, c

End

The k-means clustering algorithm

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17

Page 18: Csc446: Pattern Recognition

Data Description and Clustering • If the goal is to find subclasses , a more direct alternative

is to use clustering procedure, which produce a data

description in terms of clusters or groups of data points

that possess strong internal similarities.

• Formal clustering procedures

use a criterion function, such

as the sum of the squared

distance from class centers,

and seek the grouping that

extremizes the criterion

function.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18

Page 19: Csc446: Pattern Recognition

Data Description and Clustering

• To define what we mean by natural grouping, we need to answer the following two questions:

1. How should we measure the similarity between samples?

2. How should we evaluate a partitioning of a set samples into clusters?

• The most obvious measure of similarity (or dissimilarity)

between two samples is a metric distance between them.

• Once such a distance is defined , one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19

Page 20: Csc446: Pattern Recognition

• A metric D(., .) is a function that gives a

generalized scalar distance between two

arguments (patterns).

• A metric must has four properties:

– non-negativity: D(a, b) 0.

– reflexivity: D(a, b) = 0 a = b.

– Symmetry: D(a, b) = D (b, a) .

– Triangle inequality: D(a, b) + D (b, c) D (a, c).

Similarity Measures: Metric Distance

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20

Page 21: Csc446: Pattern Recognition

• The most popular metric functions (in d

dimensions):

– The Manhattan or

City Block Distance:

– The Euclidean Distance:

– The Minkowski Distance

or Lk norm:

• Note that, L1 and L2 norms give the Euclidean and

Manhattan metric, respectively.

Similarity Measures: Metric Distance

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21

2/1

1

2)()b,a(

d

k

kkbaD

kd

i

k

iikbaL

/1

1

||)b,a(

||)b,a(1

d

i

iibaC

Page 22: Csc446: Pattern Recognition

• The Minkowski distance for different values of k:

Similarity Measures: Metric Distance

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22

Page 23: Csc446: Pattern Recognition

• To achieve invariance, one can normalize

the data, e.g., such that they all have zero

means and unit variance, or use principal

components for invariance to rotation

• One can also used a nonmetric similarity function

s(x,x’) to compare two vectors.

• For example:

Similarity Measures

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23

x'x

x'xx'x

t

s ),(

Page 24: Csc446: Pattern Recognition

• Euclidean distance is a possible metric: a possible criterion is to assume samples belonging to same cluster if their distance is less than a threshold value; d0 .

• Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship.

Similarity Measures

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24

Page 25: Csc446: Pattern Recognition

Criterion Functions for Clustering • The second issue: how to evaluate a partitioning of a

set into clusters?

• Clustering can be posted as an optimization of a

criterion function

– Scatter criteria:

– The sum-of-squared-error criterion

– Where ni is the number of samples in Di , and mi the mean

of those samples

iDi

in x

xm1

c

i D

ie

i

J1 x

mx

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25

Page 26: Csc446: Pattern Recognition

• This criterion defines clusters as their mean vectors mi in the

sense that it minimizes the sum of the squared lengths of the

error x – mi .

• The optimal partition is defined as the minimization of Je ,

also called minimum variance partition.

• Work fine when clusters form well separated compact

clouds, less when there are great differences in the number of

samples in different clusters.

Criterion Functions for Clustering

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26

Page 27: Csc446: Pattern Recognition

• Scatter criteria

– Scatter matrices used in multiple discriminant analysis,

i.e., the within-scatter matrix SW and the between-scatter

matrix SB

ST = SB +SW

that does depend only on the set of samples (not on the

partitioning)

– The criteria can be to minimize the within-cluster or

maximize the between-cluster scatter

– The trace (sum of diagonal elements) is the simplest

scalar measure of the scatter matrix, as it is proportional

to the sum of the variances in the coordinate directions

Criterion Functions for Clustering

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27

Page 28: Csc446: Pattern Recognition

that is in practice the sum-of-squared-error criterion.

• As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from the

partitioning, no new results can be derived by minimizing

tr[SB]

• However, seeking to minimize the within-cluster criterion

Je=tr[SW], is equivalent to maximise the between-cluster

criterion where m is the total mean vector:

c

i

iiB nStr1

2mm

e

c

i D

i

c

i

iW JStrStri

1

2

1 x

mx

c

i

ii

D

nnn 1

11mxm

Criterion Functions for Clustering

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28

Page 29: Csc446: Pattern Recognition

Iterative optimization • Once a criterion function has beem selected, clustering

becomes a problem of discrete optimization.

• As the sample set is finite there is a finite number of possible

partitions, and the optimal one can be always found by

exhaustive search.

• Most frequently, it is adopted an iterative optimization

procedure to select the optimal partitions

• The basic idea lies in starting from a reasonable initial

partition and “move” samples from one cluster to another

trying to minimize the criterion function.

• In general, this kinds of approaches guarantee local,

NOT global, optimization.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29

Page 30: Csc446: Pattern Recognition

• Let us consider an iterative procedure to minimize

the sum-of-squared-error criterion Je

where Ji is the effective error per cluster.

iD

ii

c

i

ie JJJx

mx2

1

where

Iterative optimization

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 30

Page 31: Csc446: Pattern Recognition

Iterative optimization

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 31

Page 32: Csc446: Pattern Recognition

• This procedure is a sequential version of the k-means

algorithm, with the difference that k-means waits until n

samples have been reclassified before updating, whereas the

later updates each time a sample is reclassified.

• This procedure is more prone to be trapped in local minima,

and depends from the order of presentation of the samples.

• Starting point is always a problem:

• Random centers of clusters.

• Repetition with different random initialization.

• c-cluster starting point as the solution of the (c-1)-

cluster problem plus the sample farthest from the

nearer cluster center.

Iterative optimization

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 32

Page 33: Csc446: Pattern Recognition

Hierarchical Clustering

• Many times, clusters are not disjoint, but a cluster may

have subclusters, in turn having sub-subclusters, etc.

• Consider a sequence of partitions of the n samples into

c clusters:

• The first is a partition into n cluster, each one

containing exactly one sample.

• The second is a partition into n-1 clusters, the third

into n-2, and so on, until the n-th in which there is

only one cluster containing all of the samples

• At the level k in the sequence, c = n-k+1.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 33

Page 34: Csc446: Pattern Recognition

• Given any two samples x and x’, they will be grouped

together at some level, and if they are grouped at level k,

they remain grouped for all higher levels.

• Hierarchical clustering is a tree representation called

dendrogram

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 34

Hierarchical Clustering

Page 35: Csc446: Pattern Recognition

• The similarity values may help to determine if the

grouping are natural or forced, but if they are evenly

distributed no information can be gained

• Another representation is based on set, e.g., on the Venn

diagrams

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 35

Hierarchical Clustering

Page 36: Csc446: Pattern Recognition

• Hierarchical clustering can be divided in

agglomerative and divisive.

• Agglomerative (bottom up, clumping): start with n

singleton cluster and form the sequence by merging

clusters.

• Divisive (top down, splitting): start with all of the

samples in one cluster and form the sequence by

successively splitting clusters.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 36

Hierarchical Clustering

Page 37: Csc446: Pattern Recognition

• The procedure terminates when the specified number of

cluster has been obtained, and returns the cluster as sets of points, rather than the mean or a representative vector for each cluster

Agglomerative hierarchical clustering

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 37

Page 38: Csc446: Pattern Recognition

• At any level, the distance between nearest clusters can provide the dissimilarity value for that level

• To find the nearest clusters, one can use:

which behave quite similar if the clusters are hyperspherical and well separated.

• The computational complexity is O(cn2d2), n>>c.

'min),(',

min xxxx

ji DD

ji DDd 'max),(',

max xxxx

ji DD

ji DDd

i jD Dji

jiavgnn

DDdx x

xx'

'1

),(jijimean DDd mm ),(

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 38

Agglomerative hierarchical clustering

Page 39: Csc446: Pattern Recognition

Nearest-neighbor algorithm

• When dmin is used, the algorithm is called the nearest neighbor algorithm

• If it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm

• If data points are thought as nodes of a graph with edges forming a path between the nodes in the same subset Di , the merging of Di and Dj corresponds to adding an edge between the neirest pair of node in Di and Dj .

• The resulting graph has any closed loop and it is a tree, if all subsets are linked we have a spanning tree

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 39

Agglomerative hierarchical clustering

Page 40: Csc446: Pattern Recognition

• The use of dmin as a distance measure and the

agglomerative clustering generate a minimal

spanning tree

• Chaining effect: defect of this distance measure

(right) ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 40

Agglomerative hierarchical clustering

Page 41: Csc446: Pattern Recognition

The farthest neighbor algorithm:

• When dmax is used, the algorithm is called the farthest

neighbor algorithm

• If it is terminated when the distance between nearest

clusters exceeds an arbitrary threshold, it is called

complete-linkage algorithm

• This method discourages the growth of elongated

clusters

• In the terminology of the graph theory, every cluster

constitutes a complete subgraph, and the distance

between two clusters is determined by the most distant

nodes in the two clusters

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 41

Agglomerative hierarchical clustering

Page 42: Csc446: Pattern Recognition

• When two clusters are merged, the graph is changed by adding edges between every pair of nodes in the 2 clusters

• All the procedures involving minima or maxima are sensitive to outliers. The use of dmean or davg are natural compromises.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 42

Agglomerative hierarchical clustering

Page 43: Csc446: Pattern Recognition

The problem of the number of clusters

• Typically, the number of clusters is known.

• When it’s not, there are several ways of proceed.

• When clustering is done by extremizing a criterion function,

a common approach is to repeat the clustering with c=1, c=2,

c=3, etc.

• Another approach is to state a threshold for the creation of a

new cluster; this is adapt to on line cases but depends on the

order of presentation of data.

• These approaches are similar to model selection procedures,

typically used to determine the topology and number of

states (e.g., clusters, parameters) of a model, given a specific

application.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 43

Page 44: Csc446: Pattern Recognition

Graph-theoretic methods

• The graph theory permits to consider particular

structure of data.

• The procedure of setting a distance as a threshold to

place 2 points in the same cluster can be generalized

to arbitrary similarity measures.

• If s0 is a threshold value, we can say that xi is

similar to xj if s(xi, xj) > s0.

• Hence, we define a similarity matrix S = [sij]

otherwise0

),(if1 0sss

ji

ij

xx

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 44

Page 45: Csc446: Pattern Recognition

• This matrix induces a similarity graph, dual to S, in which nodes corresponds to points and edge joins node i and j iff sij=1.

• Single-linkage alg.: two samples x and x’ are in the same cluster if there exists a chain x, x1, x2, …, xk, x’, such that x is similar to x1, x1 to x2, and so on connected components of the graph

• Complete-link alg.: all samples in a given cluster must be similar to one another and no sample can be in more than one cluster.

• Neirest-neighbor algorithm is a method to find the minimum spanning tree and vice versa

– Removal of the longest edge produce a 2-cluster grouping, removal of the next longest edge produces a 3-cluster grouping, and so on.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 45

Graph-theoretic methods

Page 46: Csc446: Pattern Recognition

• This is a divisive hierarchical procedure, and suggest

ways to dividing the graph into subgraphs.

– E.g., in selecting an edge to remove, comparing its length

with the lengths of the other edges incident upon a node.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 46

Graph-theoretic methods

Page 47: Csc446: Pattern Recognition

• One useful statistics to be estimated from the

minimal spanning tree is the edge length distribution

• For instance, in the case of two dense cluster

immersed in a sparse set of points:

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 47

Graph-theoretic methods

Page 48: Csc446: Pattern Recognition

What’s Next?

Pattern Recognition Application:

Speech Recognition

ASU-CSC446: Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 48