detecting multivariate outliers using projection pursuit ... · de nition of an \interesting"...

64
Detecting multivariate outliers using projection pursuit with particle swarm optimization Anne Ruiz-Gazen Alain Berro Souad Larabi Marie-Sainte University of Toulouse 1 - Capitole (TSE - IRIT - IMT) COMPSTAT, Paris, September 2010 A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 1 / 35

Upload: vuongnhi

Post on 21-Oct-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Detecting multivariate outliers using projection pursuitwith particle swarm optimization

Anne Ruiz-Gazen Alain Berro Souad Larabi Marie-Sainte

University of Toulouse 1 - Capitole (TSE - IRIT - IMT)

COMPSTAT, Paris, September 2010

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 1 / 35

Introduction

What is Exploratory Projection Pursuit?

search for “interesting” linear low dimensional projections of highdimensional multivariate data

Interesting structures:

outliers,clusters, . . .

Two ingredients:

projection interestingness:projection index Ioptimization of the index:algorithm

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

Introduction

What is Exploratory Projection Pursuit?

search for “interesting” linear low dimensional projections of highdimensional multivariate data

Interesting structures:

outliers,clusters, . . .

Two ingredients:

projection interestingness:projection index Ioptimization of the index:algorithm

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

Introduction

What is Exploratory Projection Pursuit?

search for “interesting” linear low dimensional projections of highdimensional multivariate data

Interesting structures:

outliers,clusters, . . .

Two ingredients:

projection interestingness:projection index Ioptimization of the index:algorithm

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 2 / 35

Introduction

EPP usually known by statisticians but not used!

Well-known statistical softwares do NOT propose PP procedures(some routines in Fortran, Splus, Matlab and GGobi).

Recent applications in the domain of anomalies detection inhyperspectral imagery (Achard et al., 2004, Malpika et al., 2008,Smetek and Bauer, 2008).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 3 / 35

Introduction

EPP usually known by statisticians but not used!

Well-known statistical softwares do NOT propose PP procedures(some routines in Fortran, Splus, Matlab and GGobi).

Recent applications in the domain of anomalies detection inhyperspectral imagery (Achard et al., 2004, Malpika et al., 2008,Smetek and Bauer, 2008).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 3 / 35

Introduction

Mathematically

denote X data matrix n × p, Xi observation p × 1, continuousvariables,

data are centered and scaled (divided by standard deviation or madespherical),

consider one-dimensional projections from Rp to R : z = Xα,

where α is a p-dimensional projection vector α′α = 1,z is a n-dimensional vector: coordinates of the projected observations,

define a projection index function I : α→ I (α),

find projection vectors α : max{α∈Rp |α′α=1} I (α)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 4 / 35

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 5 / 35

PP indices for multivariate outliers detection: a review

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 6 / 35

PP indices for multivariate outliers detection: a review First proposals

First proposals

Definition of an “interesting” projection discussed in the foundingpapers on PP (Friedman and Tukey, 1974, Huber, 1985, Jones andSibson, 1987, and Friedman, 1987).

Several arguments: “gaussianity is uninteresting”.

Any measure of departure from normality = a PP index.

Objective more general than looking for projections that revealoutlying observations. However, several indices very sensitive todeparture from normality in the tails of the distribution and revealoutliers in priority.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 7 / 35

PP indices for multivariate outliers detection: a review First proposals

First proposals

Friedman-Tukey (1974):

IFT (α) =1

n2h2

n∑i=1

n∑j=1

K

(α′(Xi − Xj)

h

)

with K (u) = 3532(1− u2)3I{|u|≤1} and h = 3.12N−

16 (Klinke, 1997).

Friedman (1987): index based on the L2 distance between theprojected data distribution and the Gaussian distribution (usingexpansions based on Legendre polynomials).

Kurtosis:

Ikurt(α) =n∑

i=1

(α′Xi )4

(Huber, 1985, Pena and Prieto, 2001)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

PP indices for multivariate outliers detection: a review First proposals

First proposals

Friedman-Tukey (1974):

IFT (α) =1

n2h2

n∑i=1

n∑j=1

K

(α′(Xi − Xj)

h

)

with K (u) = 3532(1− u2)3I{|u|≤1} and h = 3.12N−

16 (Klinke, 1997).

Friedman (1987): index based on the L2 distance between theprojected data distribution and the Gaussian distribution (usingexpansions based on Legendre polynomials).

Kurtosis:

Ikurt(α) =n∑

i=1

(α′Xi )4

(Huber, 1985, Pena and Prieto, 2001)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

PP indices for multivariate outliers detection: a review First proposals

First proposals

Friedman-Tukey (1974):

IFT (α) =1

n2h2

n∑i=1

n∑j=1

K

(α′(Xi − Xj)

h

)

with K (u) = 3532(1− u2)3I{|u|≤1} and h = 3.12N−

16 (Klinke, 1997).

Friedman (1987): index based on the L2 distance between theprojected data distribution and the Gaussian distribution (usingexpansions based on Legendre polynomials).

Kurtosis:

Ikurt(α) =n∑

i=1

(α′Xi )4

(Huber, 1985, Pena and Prieto, 2001)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 8 / 35

PP indices for multivariate outliers detection: a review Other proposals

Other indices

Measure of outlyingness (Stahel-Donoho): for each observationi = 1, . . . , n,

Ii (α) =|α′Xi −medj(α

′Xj)|madj(α′Xj)

where “med” = median, “mad” = median absolute deviation of theprojected data from the median.

Dispersion-based indices: robust dispersion estimator (Li and Chen,1985, Croux and Ruiz-Gazen, 2005), defines a robust principalcomponent analysis.

Juan and Prieto (2001) for concentrated contamination patterns

Hall and Kay (2005): non parametric atypicality index

Indices adapted to time series, . . .

Many complementary definitions of indices but . . .

the main problem with PP: pursuit computationally intensive.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

PP indices for multivariate outliers detection: a review Other proposals

Other indices

Measure of outlyingness (Stahel-Donoho): for each observationi = 1, . . . , n,

Ii (α) =|α′Xi −medj(α

′Xj)|madj(α′Xj)

where “med” = median, “mad” = median absolute deviation of theprojected data from the median.

Dispersion-based indices: robust dispersion estimator (Li and Chen,1985, Croux and Ruiz-Gazen, 2005), defines a robust principalcomponent analysis.

Juan and Prieto (2001) for concentrated contamination patterns

Hall and Kay (2005): non parametric atypicality index

Indices adapted to time series, . . .

Many complementary definitions of indices but . . .

the main problem with PP: pursuit computationally intensive.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

PP indices for multivariate outliers detection: a review Other proposals

Other indices

Measure of outlyingness (Stahel-Donoho): for each observationi = 1, . . . , n,

Ii (α) =|α′Xi −medj(α

′Xj)|madj(α′Xj)

where “med” = median, “mad” = median absolute deviation of theprojected data from the median.

Dispersion-based indices: robust dispersion estimator (Li and Chen,1985, Croux and Ruiz-Gazen, 2005), defines a robust principalcomponent analysis.

Juan and Prieto (2001) for concentrated contamination patterns

Hall and Kay (2005): non parametric atypicality index

Indices adapted to time series, . . .

Many complementary definitions of indices but . . .

the main problem with PP: pursuit computationally intensive.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

PP indices for multivariate outliers detection: a review Other proposals

Other indices

Measure of outlyingness (Stahel-Donoho): for each observationi = 1, . . . , n,

Ii (α) =|α′Xi −medj(α

′Xj)|madj(α′Xj)

where “med” = median, “mad” = median absolute deviation of theprojected data from the median.

Dispersion-based indices: robust dispersion estimator (Li and Chen,1985, Croux and Ruiz-Gazen, 2005), defines a robust principalcomponent analysis.

Juan and Prieto (2001) for concentrated contamination patterns

Hall and Kay (2005): non parametric atypicality index

Indices adapted to time series, . . .

Many complementary definitions of indices but . . .

the main problem with PP: pursuit computationally intensive.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 9 / 35

Optimization procedures for PP: a review

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 10 / 35

Optimization procedures for PP: a review Strategy for the first proposals

Strategy for the first proposals

Usually complex structure needs several one-dimensional projections to berevealed ⇒ several interesting optima of the projection index.

The usual strategy

Global optimization algorithm ⇒ one optimum projection.

Remove the structure found from the data set.

Iterate the procedure.

Example: For the kurtosis index, Pena and Prieto (2001) uses a modifiedversion of Newton’s optimization method and remove the structure byprojecting the data on the space orthogonal to the projection found. Theprocedure is iterated p times (the number of dimensions).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 11 / 35

Optimization procedures for PP: a review Strategy for the first proposals

Strategy for the first proposals

Usually complex structure needs several one-dimensional projections to berevealed ⇒ several interesting optima of the projection index.

The usual strategy

Global optimization algorithm ⇒ one optimum projection.

Remove the structure found from the data set.

Iterate the procedure.

Example: For the kurtosis index, Pena and Prieto (2001) uses a modifiedversion of Newton’s optimization method and remove the structure byprojecting the data on the space orthogonal to the projection found. Theprocedure is iterated p times (the number of dimensions).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 11 / 35

Optimization procedures for PP: a review Strategy for the first proposals

Strategy for the first proposals

Limitations:

Global optimization based on repeated local optimization usuallyquite costly.

Local optimization algorithms rely on regularity conditions on theprojection index.

Structure removal may miss some interesting projections (Huber,1985, Friedman, 1987).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 12 / 35

Optimization procedures for PP: a review Other strategies

Other strategies

Finite number of projection vectors

replace the maximization problem: max{α∈Rp |α′α=1} I (α) by:

max{α∈A|α′α=1}

I (α)

where A contains a finite number of directions and

calculate I (α) for all α ∈ A.

Limitations: This strategy may miss interesting projections by notexploring enough the space of solutions.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 13 / 35

New optimization proposals

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 14 / 35

New optimization proposals A new strategy

A new strategy

Find directly several local optima

Run several times local optimization algorithms.

Use heuristics algorithms for local optimization (no need of regularitycondition and better exploration of the space of solutions)

No need for global optimization and structure removal.

Difficulty: not easy to know the extent to which a new view reflects asimilar or a different structure compared with the previous views(Friedman in Jones and Sibson, 1987, discussion) ⇒ exploratory tools toanalyze the different projections.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 15 / 35

New optimization proposals A new strategy

A new strategy

Find directly several local optima

Run several times local optimization algorithms.

Use heuristics algorithms for local optimization (no need of regularitycondition and better exploration of the space of solutions)

No need for global optimization and structure removal.

Difficulty: not easy to know the extent to which a new view reflects asimilar or a different structure compared with the previous views(Friedman in Jones and Sibson, 1987, discussion) ⇒ exploratory tools toanalyze the different projections.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 15 / 35

New optimization proposals Heuristics optimization algorithms

Heuristics optimization algorithms

Two families of heuristic optimization methods (Gilli and Winker, 2008):

the trajectory methods (e.g. simulated annealing or Tabu search)which consider one single solution at a time,

the population based methods (e.g. genetic algorithms or ParticleSwarm Optimization) which update a whole set of solutionssimultaneously. Focus on this second family of methods (explorationof the whole search space sometimes more efficient).

Main characteristics:

Heuristics optimization methods can tackle optimization problemsthat are not tractable with classical optimization tools.

They usually mimic some behavior found in nature.

Implementation of GA, PSO and Tribes (adaptive PSO algorithm).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 16 / 35

New optimization proposals Heuristics optimization algorithms

Heuristics optimization algorithms

Two families of heuristic optimization methods (Gilli and Winker, 2008):

the trajectory methods (e.g. simulated annealing or Tabu search)which consider one single solution at a time,

the population based methods (e.g. genetic algorithms or ParticleSwarm Optimization) which update a whole set of solutionssimultaneously. Focus on this second family of methods (explorationof the whole search space sometimes more efficient).

Main characteristics:

Heuristics optimization methods can tackle optimization problemsthat are not tractable with classical optimization tools.

They usually mimic some behavior found in nature.

Implementation of GA, PSO and Tribes (adaptive PSO algorithm).

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 16 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

GA

Initialization

Evaluation

Selection

Crossover and Mutation

Stop ?

End

yes

no

Model

An individual represents aprojection vector

The fitness function is theprojection index

Random initialization

Tournament selection with 3participants

probability of mutation = 0.05

probability of crossover = 0.65

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 17 / 35

New optimization proposals Heuristics optimization algorithms

PSO

Particle Swarm Optimization: Kennedy andEberhart (1995)

Stochastic method

Biological inspiration (fish schoolingand bird flocking)

Each bird seems to move randomlyThe communication between birds islimited

However, a swarm is able to find food

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 18 / 35

New optimization proposals Heuristics optimization algorithms

PSO

Particle Swarm Optimization: Kennedy andEberhart (1995)

Stochastic method

Biological inspiration (fish schoolingand bird flocking)

Each bird seems to move randomlyThe communication between birds islimitedHowever, a swarm is able to find food

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 18 / 35

New optimization proposals Heuristics optimization algorithms

PSO: strategy of displacement of a particle

Generation of aswarm ofparticles

A fitness isassociated toeach particle

Particles moveaccording to theirown experienceand that of theswarm

Convergencemade possible bythe cooperationbetween particles

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 19 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

Model

A particle represents aprojection vector

The fitness function is theprojection index

Parameters of the particle i−→xi : position−→vi : velocity−−−→pbesti best solution−−−→gbest best solution of the swarm

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

Random initialization

Velocity equation :

−→vi ← ω · −→vi

+c1 · −→r1 ⊗ (−−−→pbesti −−→xi )

+c2 · −→r2 ⊗ (−−−→gbest −−→xi )

where : ω, c1, c2 : parameters and−→r1 , −→r2 : random values

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

Random initialization

Velocity equation :

−→vi ← ω · −→vi

+c1 · −→r1 ⊗ (−−−→pbesti −−→xi )

+c2 · −→r2 ⊗ (−−−→gbest −−→xi )

where : ω, c1, c2 : parameters and−→r1 , −→r2 : random values

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

position equation :

−→xi ← −→xi +−→vi

I is the projection index

update pbesti and gbest

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

position equation :

−→xi ← −→xi +−→vi

I is the projection index

update pbesti and gbest

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

PSO Algorithm

Initialization

Update velocity

Update position

Update memory

Stop ?

End

yes

no

position equation :

−→xi ← −→xi +−→vi

I is the projection index

update pbesti and gbest

Number of iterations

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 20 / 35

New optimization proposals Heuristics optimization algorithms

Tribes method

GA and PSO compared in Berro et al. (2010). Results quite similar butseveral parameters to tune.

Tribes (Clerc, 2006) is the first parameter-free particle swarmoptimization algorithm. It is an adaptive algorithm.

Principle:

Swarm divided in “tribes”At the beginning, the swarm is composed of only one particleAccording to tribes’ behaviors, particles are added or removed in thetribesAccording to the performances of the particles, their strategies ofdisplacement are adapted

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 21 / 35

New optimization proposals Heuristics optimization algorithms

Tribes method

GA and PSO compared in Berro et al. (2010). Results quite similar butseveral parameters to tune.

Tribes (Clerc, 2006) is the first parameter-free particle swarmoptimization algorithm. It is an adaptive algorithm.

Principle:

Swarm divided in “tribes”At the beginning, the swarm is composed of only one particleAccording to tribes’ behaviors, particles are added or removed in thetribesAccording to the performances of the particles, their strategies ofdisplacement are adapted

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 21 / 35

New optimization proposals Heuristics optimization algorithms

The Tribes method

Tribes has been compared with usual PSO in Larabi et al. (2010). Tribesis more interesting in the EPP context for two reasons:

No parameter to settle except the stopping criterion,

It converges very quickly to local optima.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 22 / 35

Illustration with EPP-Lab

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 23 / 35

Illustration with EPP-Lab

The interface EPP-Lab

Implemented in Java (A. Berro, S. Larabi, E. Chabbert, I. Griffith).

Contents

Indices for outliers detection: Friedman-Tukey, Friedman and kurtosis.

Optimization algorithms: GA, PSO and Tribes.

EPP-Lab in action

First step in the analysis: the pursuit (may be time consuming but noneed for the statistician to be in front of the computer 6= GGOBI),

Second step: from the first step (projections saved), analysis of theresults (imediate)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 24 / 35

Illustration with EPP-Lab

The interface EPP-Lab

Implemented in Java (A. Berro, S. Larabi, E. Chabbert, I. Griffith).

Contents

Indices for outliers detection: Friedman-Tukey, Friedman and kurtosis.

Optimization algorithms: GA, PSO and Tribes.

EPP-Lab in action

First step in the analysis: the pursuit (may be time consuming but noneed for the statistician to be in front of the computer 6= GGOBI),

Second step: from the first step (projections saved), analysis of theresults (imediate)

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 24 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 25 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

The file don3.txt is 200× 8 with:

the first 190 observations follow a N (0, I8) distribution

and the last 10 follow a N ((10, 0, . . . , 0)′, I8) distribution.

Observations 191 to 200 are outlying.

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 26 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 27 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 28 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 29 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 30 / 35

Illustration with EPP-Lab

The interface EPP-Lab in action

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 31 / 35

Conclusion and perspectives

Contents

1 PP indices for multivariate outliers detection: a reviewFirst proposalsOther proposals

2 Optimization procedures for PP: a reviewStrategy for the first proposalsOther strategies

3 New optimization proposalsA new strategyHeuristics optimization algorithms

Genetic AlgorithmParticle Swarm Optimization algorithmTribes

4 Illustration with EPP-Lab

5 Conclusion and perspectives

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 32 / 35

Conclusion and perspectives

Conclusion and perspectives

ConclusionHeuristics algorithms interesting in the context of EPP with a newstrategy based on local optimization.Possibility of using free-parameter algorithms such as Tribes.

PerspectivesInterface EPP-Lab to improve:

study and implement several stopping criteria,implement Stahel-Donoho index,implement robust selection,Develop and implement statistical tools to summarize the differentprojections (clustering of variables, principal components analysis, sumof projectors, . . . )

Introduce multiobjective optimization in order to deal with spatial datasets.. . .

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 33 / 35

Conclusion and perspectives

Conclusion and perspectives

ConclusionHeuristics algorithms interesting in the context of EPP with a newstrategy based on local optimization.Possibility of using free-parameter algorithms such as Tribes.

PerspectivesInterface EPP-Lab to improve:

study and implement several stopping criteria,implement Stahel-Donoho index,implement robust selection,Develop and implement statistical tools to summarize the differentprojections (clustering of variables, principal components analysis, sumof projectors, . . . )

Introduce multiobjective optimization in order to deal with spatial datasets.. . .

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 33 / 35

Conclusion and perspectives

Conclusion and perspectives

ConclusionHeuristics algorithms interesting in the context of EPP with a newstrategy based on local optimization.Possibility of using free-parameter algorithms such as Tribes.

PerspectivesInterface EPP-Lab to improve:

study and implement several stopping criteria,implement Stahel-Donoho index,implement robust selection,Develop and implement statistical tools to summarize the differentprojections (clustering of variables, principal components analysis, sumof projectors, . . . )

Introduce multiobjective optimization in order to deal with spatial datasets.. . .

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 33 / 35

Conclusion and perspectives

Biblography I

Berro, A., Larabi Marie-Sainte, S. and Ruiz-Gazen, A.Genetic Algorithms and Particle Swarm Optimization for ExploratoryProjection Pursuit.Annals of Mathematics and Artificial Intelligence, to appear, 2010

Larabi Marie-Sainte S., Berro, A. and Ruiz-Gazen, A.An Efficient Optimization Method for Revealing Local Optima ofProjection Pursuit Indices.Ants 2010, Seventh International Conference on Swarm Intelligence,2010

Caussinus, H. and Ruiz-Gazen, A.Exploratory Projection Pursuit.In Data Analysis (Digital Signal and Image Processing series), Ed. G.Govaert, ISTE, 2009

Thank you for your attention!

A. Ruiz-Gazen (University of Toulouse) EPP using PSO COMPSTAT 2010 34 / 35