on the mining of numerical data with formal concept analysis

52
On the Mining of Numerical Data with Formal Concept Analysis Th` ese de doctorat en informatique Mehdi Kaytoue 22 April 2011 Amedeo Napoli ebastien Duplessis

Upload: insa-de-lyon

Post on 29-Nov-2014

283 views

Category:

Education


1 download

DESCRIPTION

PhD Dissertation Talk, 22 April 2011 ---- The main topic of this thesis addresses the important problem of mining numerical data, and especially gene expression data. These data characterize the behaviour of thousand of genes in various biological situations (time, cell, etc.). A difficult task consists in clustering genes to obtain classes of genes with similar behaviour, supposed to be involved together within a biological process. Accordingly, we are interested in designing and comparing methods in the field of knowledge discovery from biological data. We propose to study how the conceptual classification method called Formal Concept Analysis (FCA) can handle the problem of extracting interesting classes of genes. For this purpose, we have designed and experimented several original methods based on an extension of FCA called pattern structures. Furthermore, we show that these methods can enhance decision making in agronomy and crop sanity in the vast formal domain of information fusion.

TRANSCRIPT

Page 1: On the Mining of Numerical Data with Formal Concept Analysis

On the Mining of Numerical Data withFormal Concept Analysis

These de doctorat en informatique

Mehdi Kaytoue

22 April 2011

Amedeo Napoli Sebastien Duplessis

Page 2: On the Mining of Numerical Data with Formal Concept Analysis

Somewhere... in a temperate forest...

2 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 3: On the Mining of Numerical Data with Formal Concept Analysis

Context

A biological problem

: How does symbiosis work at the cellular level?

Analyse biological processes

Find genes involved in symbiosis

Choose a model forunderstanding symbiosis:Laccaria bicolor

Analysing Gene Expression Data (GED)

F. Martin et al.The Genome of Laccaria Bicolor Provides Insights into Mycorrhizal Symbiosis.In Nature., 2008.

3 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 4: On the Mining of Numerical Data with Formal Concept Analysis

Context

Gene expression data (GED)

A numerical dataset, or data-table with

genes in rows

biological situations in columns

expression value of a gene in row forthe situation in column.

A row denotes the expression profileof a gene (GEP)

m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

Biological hypothesis

A group of genes having a similar expression profile interact to-gether within the same biological process

4 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 5: On the Mining of Numerical Data with Formal Concept Analysis

Context

With very large datasets...Gene expression data of Laccaria bicolor

22,294 genes

3 types of biological situations reflecting cells of the organism invarious stages of its biological cycle:

free living myceliumsymbiotic tissuesfruiting bodies

Attribute values ranged in [0, 65000]

5 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 6: On the Mining of Numerical Data with Formal Concept Analysis

Context

Knowledge discovery in databases

An iterative and interactive process

U. Fayyad, G. Piatetsky-Shapiro and P. SmythThe KDD process for Extracting Useful Knowledge from Volumes of Data.In Commun. ACM., 1996.

6 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 7: On the Mining of Numerical Data with Formal Concept Analysis

Context

Mining gene expression data

Extracting (maximal) rectangles in numerical data

A set of genes co-expressed in some biological situations

Local patterns: biological processes may be activated in somesituations only

Overlapping patterns: a gene may be involved in severalbiological process

m1 m2 m3 m4 m5

g1 1 2 2 1 6g2 2 1 1 0 6g3 2 2 1 7 6g4 8 9 2 6 7

Biclustering: A difficult problem relying on heuristics

R. PeetersThe Maximum Edge Biclique Problem is NP-Complete.In Discrete Applied Math., vol. 131, no. 3., 2003

7 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 8: On the Mining of Numerical Data with Formal Concept Analysis

Context

Core of the thesis

Mining gene expression data with formal concept analysis

Turning GED into binary, encoding over/under expression

Bringing the problem into well-known settings

Allowing a complete and mathematically well defined approach

Exploiting algorithms and “tools”

m1 m2 m3 m4 m5

g1 1 2 2 1 6g2 2 1 1 5 6g3 2 2 1 7 6g4 8 9 2 6 7

m1 m2 m3 m4 m5

g1 0 0 0 0 1g2 0 0 0 0 1g3 0 0 0 1 1g4 1 1 0 1 1

Can we work with FCA directly on numerical data?

8 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 9: On the Mining of Numerical Data with Formal Concept Analysis

Context

Core of the thesis

Mining gene expression data with formal concept analysis

Turning GED into binary, encoding over/under expression

Bringing the problem into well-known settings

Allowing a complete and mathematically well defined approach

Exploiting algorithms and “tools”

m1 m2 m3 m4 m5

g1 1 2 2 1 6g2 2 1 1 5 6g3 2 2 1 7 6g4 8 9 2 6 7

m1 m2 m3 m4 m5

g1 ×g2 ×g3 × ×g4 × × × ×

Can we work with FCA directly on numerical data?

8 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 10: On the Mining of Numerical Data with Formal Concept Analysis

Context

Outline

1 Context

2 Formal Concept Analysis

3 ContributionsInterval pattern structuresIntroducing similarityA KDD-oriented discussion

4 Conclusion and perspectives

9 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 11: On the Mining of Numerical Data with Formal Concept Analysis

Formal Concept Analysis

A binary table as a formal context

Given by (G ,M, I ) with

G a set of objects

M a set of attributes

I a binary relation between objects and attributes:(g ,m) ∈ I means that “object g owns attribute m”

m1 m2 m3

g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×

G = {g1, . . . , g5}M = {m1,m2,m3}

(g1,m3) ∈ I

B. Ganter and R. WilleFormal Concept Analysis.In Springer, Mathematical foundations., 1999.

10 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 12: On the Mining of Numerical Data with Formal Concept Analysis

Formal Concept Analysis

A maximal rectangle as a formal concept

A Galois connection to characterize formal concepts

A′ = {m ∈ M | ∀g ∈ A ⊆ G : (g ,m) ∈ I}

B ′ = {g ∈ G | ∀m ∈ B ⊆ M : (g ,m) ∈ I}

(A,B) is a concept with extent A = B ′ and intent B = A′

{g3}′ = {m2,m3}

{m2,m3}′ = {g3, g4, g5}

m1 m2 m3

g1 × ×g2 × ×g3 × ×g4 × ×g5 × × ×

({g3, g4, g5}, {m2,m3}) is a formal concept

11 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 13: On the Mining of Numerical Data with Formal Concept Analysis

Formal Concept Analysis

Concept latticeOrdered set of concepts...

(A1,B1) ≤ (A2,B2)⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1)

({g1, g5}, {m1,m3}) ≤ ({g1, g2, g5}, {m1})

... with interesting properties

Maximality of concepts as rectangles

Overlapping of concepts

Specialization/generalisation hierarchy

Synthetic representation of the data without loss of information

12 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 14: On the Mining of Numerical Data with Formal Concept Analysis

Formal Concept Analysis

Handling numerical data with FCA?

Initial problem

Extracting groups of genes with similar numerical values

Conceptual scaling (discretization or binarization)

An object has an attribute if its value lies in a predefined interval

m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

m1, [4, 5] m2, [4, 7] m3, [5, 6]

g1 × × ×g2g3 × ×g4 ×g5 × ×

Different scalings: different interpretations of the data

General problem of the thesis

How to directly build a concept lattice from numerical data?

13 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 15: On the Mining of Numerical Data with Formal Concept Analysis

1 Context

2 Formal Concept Analysis

3 ContributionsInterval pattern structuresIntroducing similarityA KDD-oriented discussion

4 Conclusion and perspectives

Page 16: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Interval pattern structures

How to handle complex descriptions

An intersection as a similarity operator

∩ behaves as similarity operator

{m1,m2} ∩ {m1,m3} = {m1}

∩ induces an ordering relation ⊆

N ∩ O = N ⇐⇒ N ⊆ O{m1} ∩ {m1,m2} = {m1} ⇐⇒ {m1} ⊆ {m1,m2}

∩ has the properties of a meet u in a semi lattice,a commutative, associative and idempotent operation

c u d = c ⇐⇒ c v dA. Tversky

Features of similarity.In Psychological Review, 84 (4), 1977.

15 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 17: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Interval pattern structures

Pattern structure

Given by (G , (D,u), δ)

G a set of objects

(D,u) a semi-lattice of descriptions or patterns

δ a mapping such as δ(g) ∈ D describes object g

A Galois connection

A� =l

g∈Aδ(g) for A ⊆ G

d� = {g ∈ G |d v δ(g)} for d ∈ (D,u)

B. Ganter and S. O. KuznetsovPattern Structures and their Projections.In International Conference on Conceptual Structures, 2001.

16 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 18: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Interval pattern structures

Numerical data are pattern structuresInterval pattern structures

m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

{g1, g2}� =l

g∈{g1,g2}δ(g)

= 〈5, 7, 6〉 u 〈6, 8, 4〉= 〈[5, 6], [7, 8], [4, 6]〉

〈[5, 6], [7, 8], [4, 6]〉� = {g ∈ G |〈[5, 6], [7, 8], [4, 6]〉 v δ(g)}= {g1, g2, g5}

({g1, g2, g5}, 〈[5, 6], [7, 8], [4, 6]〉) is a (pattern) concept

17 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 19: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Interval pattern structures

Interval pattern concept lattice

Lowest concepts: few objects, small intervals

Highest concepts: many objects, large intervals

Concept/pattern overwhelming

18 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 20: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Interval pattern structures

Links with conceptual scaling

Interordinal scaling [Ganter & Wille]

A scale to encode intervals of attribute values

m1 ≤ 4 m1 ≤ 5 m1 ≤ 6 m1 ≥ 4 m1 ≥ 5 m1 ≥ 6

4 × × × ×5 × × × ×6 × × × ×

Equivalent concept lattice

Example({g1, g2, g5}, {m1 ≤ 6,m1 ≥ 4,m1 ≥ 5, ... , ... })({g1, g2, g5}, 〈[5, 6] , ... , ... 〉)

Why should we use pattern structures as we have scaling?

Processing a pattern structure is more efficient

19 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 21: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Outline

1 Context

2 Formal Concept Analysis

3 ContributionsInterval pattern structuresIntroducing similarityA KDD-oriented discussion

4 Conclusion and perspectives

20 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 22: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Introducing a similarity relation

Grouping in a same concept objects having similar values?

A natural similarity relation on numbers

a 'θ b ⇔ |a− b| ≤ θ e.g. 4 '1 5 4 6'1 6

Similarity operator u in pattern structures

4 5 6

[4,5] [5,6]

[4,6]

How to consider a similarity relation w.r.t. a distance?

21 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 23: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Introducing a similarity relation

Grouping in a same concept objects having similar values?

A natural similarity relation on numbers

a 'θ b ⇔ |a− b| ≤ θ e.g. 4 '1 5 4 6'1 6

Similarity operator u in pattern structures

θ = 2

4 5 6

[4,5] [5,6]

[4,6]

How to consider a similarity relation w.r.t. a distance?

21 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 24: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Introducing a similarity relation

Grouping in a same concept objects having similar values?

A natural similarity relation on numbers

a 'θ b ⇔ |a− b| ≤ θ e.g. 4 '1 5 4 6'1 6

Similarity operator u in pattern structures

θ = 1

4 5 6

[4,5] [5,6]

[4,6]

How to consider a similarity relation w.r.t. a distance?

21 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 25: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Introducing a similarity relation

Grouping in a same concept objects having similar values?

A natural similarity relation on numbers

a 'θ b ⇔ |a− b| ≤ θ e.g. 4 '1 5 4 6'1 6

Similarity operator u in pattern structures

θ = 04 5 6

[4,5] [5,6]

[4,6]

How to consider a similarity relation w.r.t. a distance?

21 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 26: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Towards a similarity between values

Introduce an element ∗ ∈ (D,u) denoting dissimilarity

c u d = ∗ iff c 6'θ dc u d 6= ∗ iff c 'θ d

Example with θ = 1m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

{g3, g4}� = 〈[4, 4], [8, 9], ∗〉〈[4, 4], [8, 9], ∗〉� = {g3, g4}

({g3, g4}, 〈[4, 4], [8, 9], ∗〉) is a concept:g3 and g4 have similar values for attributes m1 and m2 only

22 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 27: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Towards a similarity between values

Introduce an element ∗ ∈ (D,u) denoting dissimilarity

c u d = ∗ iff c 6'θ dc u d 6= ∗ iff c 'θ d

Example with θ = 1m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

{g3, g4}� = 〈[4, 4], [8, 9], ∗〉〈[4, 4], [8, 9], ∗〉� = {g3, g4}

({g3, g4}, 〈[4, 4], [8, 9], ∗〉) is a concept:g3 and g4 have similar values for attributes m1 and m2 only

Is {g3, g4} maximal w.r.t. similarity? We can add g5...

22 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 28: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Classes of tolerance in numerical data

Towards maximal sets of similar values

'θ a tolerance relation : reflexive, symmetric, not transitive

Consider an attribute taking values in {6, 8, 11, 16, 17} and θ = 5

8 '5 11, 11 '5 16 but 8 6'5 16

A class of tolerance as a maximal set of pairwise similar values

{6, 8, 11} {11, 16} {16, 17}[6, 11] [11, 16] [16, 17]

S. O. KuznetsovGalois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research.In Formal Concept Analysis, Foundations and Applications, 2005.

23 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 29: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Tolerance in pattern structures

Projecting the pattern structure

Each value is replaced by the interval characterizing its class oftolerance (if unique)

Each pattern d is projected with a mapping ψ(d) v d(pre-processing)

Example with θ = 1m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

{g3, g4}� = ψ(〈[4, 4], [8, 9], ∗〉)= 〈[4, 5], [8, 9], ∗〉

〈[4, 5], [8, 9], ∗〉� = {g3, g4, g5}

24 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 30: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Biological results

An extracted pattern among 2, 150 others

Genes present a high expression level in the fruit-body situations

Some of these genes encode metabolic enzymes in remobilizationof fungal resources towards the new organ in development

Other genes are unknown but specific to Laccaria Bicolor : itrequires biological experiments

25 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 31: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – Introducing similarity

Relevant publications

Interval pattern structures and GED analysis

M. Kaytoue, S. Duplessis, S. O. Kuznetsov, and A. NapoliTwo FCA-Based Methods for Mining Gene Expression Data.In International Conference on Formal Concept Analysis (ICFCA), 2009.

M. Kaytoue, S. O. Kuznetsov, A. Napoli and S. DuplessisMining Gene Expression Data with Pattern Structures in Formal Concept Analysis.In Information Sciences. Spec. Iss.: Lattices (Elsevier), 2011.

Introducing tolerance relations and information fusion

M. Kaytoue, Z. Assaghir, N. Messai and A. NapoliTwo Complementary Classification Methods for Designing a Concept Lattice from Interval Data.In Foundations of Information and Knowledge Systems, 6th International Symposium (FoIKS), 2010.

M. Kaytoue, Z. Assaghir, A. Napoli and S. O. KuznetsovEmbedding Tolerance Relations in Formal Concept Analysis: an Application in Information Fusion.In ACM Conference on Information and Knowledge Management (CIKM), 2010.

26 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 32: On the Mining of Numerical Data with Formal Concept Analysis

Contributions –

Other works

Pattern structures are useful for several tasks

Bi-clustering and tolerance relations

M. Kaytoue, S. O. Kuznetsov, and A. NapoliBiclustering Numerical Data in Formal Concept Analysis.In International Conference on Formal Concept Analysis (ICFCA), 2011.

Information fusion: enhancing decision making

Z. Assaghir, M. Kaytoue, A. Napoli and H. PradeManaging Information Fusion with Formal Concept Analysis.In Modeling Decisions for Artificial Intelligence, 6th International Conference (MDAI), 2010.

KDD: a study of equivalence classes of interval patterns

M. Kaytoue, S. O. Kuznetsov, and A. NapoliRevisiting Numerical Pattern Mining with Formal Concept Analysis.In International Joint Conference on Artificial Intelligence (IJCAI), 2011.

27 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 33: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Outline

1 Context

2 Formal Concept Analysis

3 ContributionsInterval pattern structuresIntroducing similarityA KDD-oriented discussion

4 Conclusion and perspectives

28 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 34: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Interval pattern search space

Counting all possible interval patterns

〈[am1 , bm1 ], [am2 , bm2 ], ...〉where ami , bmi ∈Wmi

m1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 5 8 5

∏i∈{1,...,|M|}

|Wmi | × (|Wmi |+ 1)

2

360 possible interval patterns in our small example

29 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 35: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 36: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}

〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 37: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}

〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 38: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}

〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 39: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}

〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 40: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}

〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 41: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Semantics for interval patterns

Interval patterns as (hyper) rectangles

m1 m3

g1 5 6g2 6 4g3 4 5g4 4 8g5 5 5

〈[4, 5], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [5, 7]〉� = {g1, g3, g5}〈[4, 6], [5, 6]〉� = {g1, g3, g5}〈[4, 5], [4, 6]〉� = {g1, g3, g5}〈[4, 6], [5, 7]〉� = {g1, g3, g5}〈[4, 5], [4, 7]〉� = {g1, g3, g5}

3

4

5

6

7

8

3 4 5 6m1

m3

b

b

b

b

b

δ(g1)

δ(g2)

δ(g3)

δ(g4)

δ(g5)

30 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 42: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

A condensed representation

Equivalence classes of interval patterns

Two interval patterns with same image are said to be equivalent

c ∼= d ⇐⇒ c� = d�

Equivalence class of a pattern d

[d ] = {c |c ∼= d}

with a unique closed pattern: the smallest rectangle

and one or several generators: the largest rectangles

Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal.Mining frequent patterns with counting inference.SIGKDD Expl., 2(2):66–75, 2000.

In our example: 360 patterns ; 18 closed ; 44 generators

31 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 43: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Algorithms & experiments

Algorithms: MintIntChange, MinIntChangeG[t|h]

4 5 6

[4,5] [5,6]

[4,6]

Experiments

Mining several datasets from Bilkent University Repository

Compression rate varies between 107 and 109

Interordinal scaling: encodes ' 30.000 binary patterns

not efficient even with best algorithms (e.g. LCMv2)redundancy problem discarding its use for generator extraction

32 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 44: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Algorithms & experiments

Algorithms: MintIntChange, MinIntChangeG[t|h]

4 5 6

[4,5] [5,6]

[4,6]

Experiments

Mining several datasets from Bilkent University Repository

Compression rate varies between 107 and 109

Interordinal scaling: encodes ' 30.000 binary patterns

not efficient even with best algorithms (e.g. LCMv2)redundancy problem discarding its use for generator extraction

32 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 45: On the Mining of Numerical Data with Formal Concept Analysis

Contributions – A KDD-oriented discussion

Discussion

Advantages

Minimum description length principle favours generators

Potential applications

Data privacy and k-anonymisationk-box problem in computational geometryQuantitative association rule miningData summarization

Problem

With very large data set, compression is not enough

Numerical data are noisy

Need of fault-tolerant condensed representations

33 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 46: On the Mining of Numerical Data with Formal Concept Analysis

1 Context

2 Formal Concept Analysis

3 ContributionsInterval pattern structuresIntroducing similarityA KDD-oriented discussion

4 Conclusion and perspectives

Page 47: On the Mining of Numerical Data with Formal Concept Analysis

Conclusion and perspectives

Conclusion

A new insight for the mining numerical data

Our main tools...

Formal Concept Analysis and conceptual scaling

Pattern structures and projections

Tolerance relation

... for numerical data mining

Conceptual representations of numerical data

Bi-clustering

Information fusion

Applications: GED analysis and agricultural practice assessment

35 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 48: On the Mining of Numerical Data with Formal Concept Analysis

Conclusion and perspectives

ConclusionAn application in GED analysis

With FCA and pattern structures

Many ways of extracting patterns in GED

Biological validation of several patterns

We now need a systematic validation step using new knowledge

transcription factors

biological knowledge base, e.g. Gene Ontology

36 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 49: On the Mining of Numerical Data with Formal Concept Analysis

Conclusion and perspectives

To be continued...Short- and mid- term

Handle other types of biclusters and algorithm comparison

S. C. Madeira and A. L. OliveiraBiclustering Algorithms for Biological Data Analysis: a survey.In IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004.

Insert domain knowledge for biological data

Study threshold θ effect w.r.t. the number of tolerance classes

Post-doctoral position

Biclustering (multi-dimensional) numerical data

Numerical pattern based classifier and association rules

Data privacy and pattern projection

Wagner Jr. Meira (Universidade Federal de Minas Gerais, Brasil)

37 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 50: On the Mining of Numerical Data with Formal Concept Analysis

Conclusion and perspectives

Cross-domain fertilizationItemset-mining in KDD

Other frameworks for closed patterns

H. Arimura and T. UnoPolynomial-Delay and Polynomial-Space Algorithms for Mining Closed Sequences, Graphs, andPictures in Accessible Set Systems.In SIAM International Conference on Data Mining, 2009.

G.C. GarrigaFormal Methods for Mining Structured Objects.PhD Thesis, Universitat Politecnica de Catalunya, 2006

Condensed representations and fault-tolerant patternsm1 m2 m3

g1 5 7 6g2 6 8 4g3 4 8 5g4 4 9 8g5 15 8 5

R. Pensa and J.-F. BoulicautTowards Fault-Tolerant Formal Concept Analysis.In Proc. 9th Congress of the Italian Association for Artificial Intelligence (AI*IA), Springer, 2005.

38 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 51: On the Mining of Numerical Data with Formal Concept Analysis

Conclusion and perspectives

Cross-domain fertilization

Data-analysis

Symbolic data analysis and distances

P. Agarwal, M. Kaytoue, S. O. Kuznetsov, A. Napoli and G. PolaillonSymbolic Galois Lattices with Pattern Structures.In International Conference on Rough Sets, Fuzzy Sets, Data-mining and Granularity Computing(RSFDGrC), 2011.

Information fusion and fuzzy concept analysis

Fuzzy settings and possibility theory

Z. Assaghir, M. Kaytoue, and H. PradeA Possibility Theory Oriented Discussion of Conceptual Pattern Ptructures.In Scalable Uncertainty Management, 4th International Conference (SUM), 2010.

39 / 40On the Mining of Numerical Data with Formal Concept Analysis

N

Page 52: On the Mining of Numerical Data with Formal Concept Analysis

Merci

Danke schonSpasibo

40 / 40On the Mining of Numerical Data with Formal Concept Analysis

N