coclus icdm workshop talk

65
1 1 1,2 1 2

Upload: dmitry-ignatov

Post on 18-Dec-2014

192 views

Category:

Documents


1 download

DESCRIPTION

Some updates on my biclustering for Sponsored Ad since 2008

TRANSCRIPT

Page 1: CoClus ICDM Workshop talk

Concept-based Biclustering for Internet

Advertisement

Dmitry I. Ignatov1 Sergei O. Kuznetsov1

Jonas Poelmans1,2

1Higher School of Economics, Department of Applied Mathematics andInformation Science

2Katholieke Universiteit Leuven

CoClus: ICDM Workshop on Coclustering and Applications

2012, Brussels

1 / 58

Page 2: CoClus ICDM Workshop talk

Outline

1 Problem statement

2 Formal Concept Analysis

3 Concept-based biclustering

4 Data and Experiments

5 Concept-based triclustering

6 Conclusion

2 / 58

Page 3: CoClus ICDM Workshop talk

Aim of research

1 Developing FCA-based biclustering algorithms for Internet

advertising

2 Experimental validation of our approach for Internet

advertising

3 / 58

Page 4: CoClus ICDM Workshop talk

Problem statement

Contextual Internet-advertising

Finding out advertising phrases potentially interesting for

an advertiser

For example: Google AdWords, Yandex.Direct etc.

4 / 58

Page 5: CoClus ICDM Workshop talk

Formal Concept Analysis. Short history.

Logique de Port-Royal (1662)

G. Birkho�, since 1930s

O. Øre, since 1930s

M. Barbut, B. Monjardet, Ordre et classi�cation, Hachette,

Paris, 1970

R. Wille, Restructuring lattice theory: An approach based

on hierarchies of concepts, 1982

B. Ganter, R. Wille, Formale Begri�sanalyse, Springer,

1996

B. Ganter, R. Wille, Formal Concept Analysis, Springer,

1999

Chapter in B. Davey, H. Priestly, Introduction to Order

and Lattices, 1990.

Chapter in last edition of G. Gr�atzer (Ed.), General Lattice

Theory.

Concept Data Analysis, C.Carpineto, G. Romano, 2004. 5 / 58

Page 6: CoClus ICDM Workshop talk

Main FCA conferences

International Conference on Conceptual Structures (ICCS): since 1996section on FCA (proceedings in LNAI, Springer)

International Conference on Formal Concept Analysis (ICFCA): since 2003(proceedings in LNAI, Springer)

International Conference on Concept Lattices and Their Applications(CLA): since 2006, local proceedings, with subsequent special issue

6 / 58

Page 7: CoClus ICDM Workshop talk

Formal Concept Analysis[R.Wille, 1982], [B.Ganter and R.Wille, 1999]

M , a set of attributes

G, a set of objects

relation I ⊆ G×M such that (g,m) ∈ I if and only if the object g hasthe attribute m.

K := (G,M, I) is a formal context.

Derivation operators: A ⊆ G, B ⊆M

A′def= {m ∈M | gIm for all g ∈ A}, B′ def= {g ∈ G | gIm for all m ∈ B}

A formal concept is a pair (A,B): A ⊆ G, B ⊆M, A′ = B, and B′ = A.

- A is the extent and B is the intent of the concept (A,B).

- The concepts, ordered by (A1, B1) ≥ (A2, B2) ⇐⇒ A1 ⊇ A2

form a complete lattice, called the concept lattice B(G,M, I).

7 / 58

Page 8: CoClus ICDM Workshop talk

Example of context of geometrical �gures and its concept lattice

({1,2,3,4},∅)

({1,4},{d}) ({2,3,4},{c}) ({1,2},{a})

({3,4},{b,c})

({1},{a,d})

(∅,M)

({4},{b,c,d}) ({2},{a,c})

G \ M a b c d

1 × ×

2 × ×

3 × ×

4 × × ×

a � has exactly 3 vertices,

b � has exactly 4 vertices,

c � has a right angle,

d � is equilateral8 / 58

Page 9: CoClus ICDM Workshop talk

Implications

Implication A→ B for A,B ⊆M holds if A′ ⊆ B′, i.e., every object thathas all attributes from A also has all attributes from B.

Armstrong rules:

A→ B

A ∪ C → B,

A→ B,A→ C

A→ B ∪ C,

A→ B,B → C

A→ C

A Minimal implication base:A base with the minimum number of implications [Duquenne, Guigues 1986]orthe stem base, its premises can be given (Ganter 1987) by pseudointents:

A set P ⊆M is a pseudointent if

P 6= P ′′ andQ′′ ⊂ P for every pseudointent Q ⊂ P .

9 / 58

Page 10: CoClus ICDM Workshop talk

Partial implications or association rules[Luxenburger M., 1991], [Agrawal R. et al., 1993]

Luxenburger M. Implications partielles dans un contexte. Math�ematiques, Informatiqueet Sciences Humaines, 113 (29) : 35-55, 1991.

Agrawal R., Imielinski T., Swami A. Mining association rules between sets of items inlarge databases, Proceedings, ACM SIGMOD Conference on Management of Data,Washington D.C., pp. 207-216, 1993.

Let K = (G,M, I) be a formal context.

De�nition 1

Association rule of the context K is an attribute dependency A→ B, whereA,B ⊆M .

De�nition 2

Support of the association rules A→ B is supp(A→ B) = |(A∪B)′||G| .

De�nition 3

Con�dence of the association rule A→ B is conf(A→ B) = |(A∪B)′||A′| .

10 / 58

Page 11: CoClus ICDM Workshop talk

Biclustering[Mirkin, 1995]

Coinage the term bicluster

The term bicluster(ing) was proposed by B. Mirkin in the book

Mathematical Classi�cation and Clustering. Kluwer Academic

Publishers (1996).

p. 296

The term biclustering refers to simultaneous clustering

of both row and column sets in a data matrix.

Biclustering addresses the problems of aggregate

representation of the basic features of interrelation

between rows and columns as expressed in the data.

11 / 58

Page 12: CoClus ICDM Workshop talk

Biclustering: general de�nition[Madeira et al., 2004]

Let An×m be a matrix, numeric or boolean

X = {x1, x2, . . . , xn} is a set of rowsY = {y1, y2, . . . , yn} is a set of columnsI ⊆ X and J ⊆ Y are subsets of rows and columns

AIJ = (I, J) is a submatrix of A

Cluster of rows AIY = (I, Y )

Cluster of columns AXJ = (X, J)

Bicluster is a submatrix of A in the form AIJ = (I, J)

B = {Bk = (Ik, Jk)} is a set of biclusters

12 / 58

Page 13: CoClus ICDM Workshop talk

Biclustering methods[S. Madeira et al., 2004]

Biclustering methods in Bioinformatics in 2004

Algorithm Bicluster type Structure Generation Search strategyBlock Clusters Constant f One Set at a Time Div-and-Conqδ-biclusters Coherent Values i One at a Time Greedy

FLOC Coherent Values i Simultaneous GreedypClusters Coherent Values g Simultaneous Exh-Enum

Plaid Models Coherent Values i One at a Time Dist-IdentPRMs Coherent Values i Simultaneous Dist-IdentCTWC Constant Columns i One Set at a Time Clust-CombITWC Coherent Values d,e One Set at a Time Clust-CombDCC Constant b,c Simultaneous Clust-Comb

δ-Patterns Constant Rows i Simultaneous GreedySpectral Coherent Values c Simultaneous GreedyGibbs Constant Columns d,e One at a Time Dist-IdentOPSMs Coherent Evolution a,i One at a Time GreedySAMBA Coherent Evolution i Simultaneous Exh-EnumxMOTIFs Coherent Evolution a,i Simultaneous Greedy

OP-Clusters Coherent Evolution i Simultaneous Exh-Enum

13 / 58

Page 14: CoClus ICDM Workshop talk

BiMax algorithm: rediscovery formal concepts in Bioinformatics

De�nition [S. Barkow et al, 2006]

Given m genes, n situations and a binary table e such that eij = 1 (gene iis active in situation j) or eij = 0 (gene i is not active in situation j) for alli ∈ [1,m] and j ∈ [1, n], the pair (G,C) ∈ 2{1,...,n} × 2{1,...,m} is called aninclusion-maximal bicluster if and only if (1) ∀i ∈ G, j ∈ C : eij = 1 and (2)@(G1, C1) ∈ 2{1,...,n} × 2{1,...,m} with (a) ∀i1 ∈ G1, ∀j1 ∈ C1: ei1j1 = 1 and(b) G ⊆ G1 ∧ C ⊆ C1 ∧ (G1, C1) 6= (G,C).

Denote by H the set of genes (objects in general), by S the set of situations(attributes in general), and by E ⊆ H × S the binary relation given by thebinary table e, |H| = m, |S| = n.

Proposition [Kuznetsov et al., 2009]

For every pair (G,C), G ⊆ H, C ⊆ S the following two statements areequivalent.1. (G,C) is an inclusion-maximal bicluster of the table e;2. (G,C) is a formal concept of the context (H,S,E).

14 / 58

Page 15: CoClus ICDM Workshop talk

BiMax algorithm: rediscovery formal concepts in Bioinformatics

De�nition [S. Barkow et al, 2006]

Given m genes, n situations and a binary table e such that eij = 1 (gene iis active in situation j) or eij = 0 (gene i is not active in situation j) for alli ∈ [1,m] and j ∈ [1, n], the pair (G,C) ∈ 2{1,...,n} × 2{1,...,m} is called aninclusion-maximal bicluster if and only if (1) ∀i ∈ G, j ∈ C : eij = 1 and (2)@(G1, C1) ∈ 2{1,...,n} × 2{1,...,m} with (a) ∀i1 ∈ G1, ∀j1 ∈ C1: ei1j1 = 1 and(b) G ⊆ G1 ∧ C ⊆ C1 ∧ (G1, C1) 6= (G,C).

Denote by H the set of genes (objects in general), by S the set of situations(attributes in general), and by E ⊆ H × S the binary relation given by thebinary table e, |H| = m, |S| = n.

Proposition [Kuznetsov et al., 2009]

For every pair (G,C), G ⊆ H, C ⊆ S the following two statements areequivalent.1. (G,C) is an inclusion-maximal bicluster of the table e;2. (G,C) is a formal concept of the context (H,S,E).

14 / 58

Page 16: CoClus ICDM Workshop talk

Concept-based biclustering[D. Ignatov and S. Kuznetsov, 2010]

Let K = (G,M, I ⊆ G×M) be a formal context.

De�nition 1

If (g,m) ∈ I, then (m′, g′) is called an object-attribute or

OA-bicluster with density ρ(m′, g′) = |I∩(m′×g′)||m′|·|g′| .

15 / 58

Page 17: CoClus ICDM Workshop talk

Geometric interpretation of OA-bicluster[D. Ignatov and S. Kuznetsov, 2010]

g

m

g''

m''

g'

m'

16 / 58

Page 18: CoClus ICDM Workshop talk

OA-biclustering properties[D. Ignatov and S. Kuznetsov, 2010]

Properties

1 0 ≤ ρ ≤ 1.

2 OA-bicluster (m′, g′) is a formal concept i� ρ = 1.

3 if (m′, g′) is an OA-bicluster, then (g′′, g′) ≤ (m′,m′′).

17 / 58

Page 19: CoClus ICDM Workshop talk

OA-biclustering properties[D. Ignatov and S. Kuznetsov, 2010]

Property

The constraint ρ(A,B) ≥ ρmin is neither monotonic nor

anti-monotonic w.r.t. v relation.

If ρmin = 0, this means that we consider the set of allOA-biclusters of the context K.For ρmin = 0 every formal concept is �contained� in a

OA-bicluster of the context K.Proposition

For each (Ac, Bc) ∈ B(G,M, I) there exists a OA-bicluster(Ab, Bb) ∈ B such that (Ac, Bc) v (Ab, Bb).

18 / 58

Page 20: CoClus ICDM Workshop talk

OA-biclustering properties[D. Ignatov and S. Kuznetsov, 2010]

Property

The constraint ρ(A,B) ≥ ρmin is neither monotonic nor

anti-monotonic w.r.t. v relation.

If ρmin = 0, this means that we consider the set of allOA-biclusters of the context K.

For ρmin = 0 every formal concept is �contained� in a

OA-bicluster of the context K.Proposition

For each (Ac, Bc) ∈ B(G,M, I) there exists a OA-bicluster(Ab, Bb) ∈ B such that (Ac, Bc) v (Ab, Bb).

18 / 58

Page 21: CoClus ICDM Workshop talk

OA-biclustering properties[D. Ignatov and S. Kuznetsov, 2010]

Property

The constraint ρ(A,B) ≥ ρmin is neither monotonic nor

anti-monotonic w.r.t. v relation.

If ρmin = 0, this means that we consider the set of allOA-biclusters of the context K.For ρmin = 0 every formal concept is �contained� in a

OA-bicluster of the context K.Proposition

For each (Ac, Bc) ∈ B(G,M, I) there exists a OA-bicluster(Ab, Bb) ∈ B such that (Ac, Bc) v (Ab, Bb).

18 / 58

Page 22: CoClus ICDM Workshop talk

OA-biclustering Algorithm Complexity[D. Ignatov and S. Kuznetsov, 2010]

Proposition 1

For a given formal context K = (G,M, I) and ρmin = 0 the

largest number of OA-biclusters is equal to |I|, all OA-biclusterscan be generated in time O(|I| · (|G|+ |M |)).

Proposition 2

For a given formal context K = (G,M, I) and ρmin > 0 the

largest number of OA-biclusters is equal to |I|, all OA-biclusterscan be generated in time O(|I| · |G| · |M |).

OA-biclustering VS FCA

O(|I| · |G| · |M |) VS O(|L| · |G|2 · |M |)

19 / 58

Page 23: CoClus ICDM Workshop talk

OA-biclustering Algorithm Complexity[D. Ignatov and S. Kuznetsov, 2010]

Proposition 1

For a given formal context K = (G,M, I) and ρmin = 0 the

largest number of OA-biclusters is equal to |I|, all OA-biclusterscan be generated in time O(|I| · (|G|+ |M |)).

Proposition 2

For a given formal context K = (G,M, I) and ρmin > 0 the

largest number of OA-biclusters is equal to |I|, all OA-biclusterscan be generated in time O(|I| · |G| · |M |).

OA-biclustering VS FCA

O(|I| · |G| · |M |) VS O(|L| · |G|2 · |M |)

19 / 58

Page 24: CoClus ICDM Workshop talk

Recommendation of an advertisement phrase

Input data

Data on purchases of bids (advertisement phrases), the formal contextKFT = (F, T, IFT ), where F is the set of advertisers, T � the set of bids,fIt denotes that advertiser f ∈ F bought bid t ∈ T . The size of the contextis 2000× 3000.

Problem statement

It is required to extract bid markets for further recommendations.

Approaches

FCA, D-miner algorithm

OA-biclustering

search for association rules

Morphology-based metarules

Ontology-based metarules

20 / 58

Page 25: CoClus ICDM Workshop talk

Detecting large market sectors with D-miner

[Besson et al, 2004], D-miner, O(|G|2|M ||L|)

Глава 4. Проведение экспериментов Основные эксперименты проводились на наборе данных компании Overture (ныне часть Yahoo Inc.). Подготовленные для анализа данные представляют собой объектно-признаковую таблицу. Объекты соответствуют фирмам, а признаки рекламным словам и словосочетаниям, которые такие компании приобретают для продвижения своих товаров через Интернет. Число фирм – 2000, слов – 3000. Целью экспериментов являлось выявле-ние методов, которые способны обнаружить соответствующие бикластеры в исходных данных за разумное время. Сами бикластеры предполагается использовать для формиро-вания рекомендаций фирмам, принадлежащим одному рынку. А именно, после проведе-ния такого анализа можно указать для конкретной фирмы на то, какие слова приобретают их конкуренты. Нами был использован подход, основанный на ФАП. Ввиду большой вычислительной сложности (при построении всей решетки понятий) мы использовали алгоритм D-miner (Besson et. al., 2004), который порождает формальные понятия, учитывая ограничения на размер объема и содержания (см. таблицу 1).

Таблица 1

Минимальный размер объе-ма понятия

Минимальный размер со-держания

Число формальных понятий

0 0 8 950 740 10 10 3 030 335 15 10 759 963 15 15 150 983 15 20 14 226 20 15 661

В полученном слое решетки мы можем выбрать понятия верхней границы слоя и понятия из нижней, «склеить» их определенным образом и получить кластеры с некоторым числом нулей внутри. При этом понятия нижней границы должны быть потомками выбранного понятия верхнего слоя. Такие кластеры вполне пригодны для формирования рекоменда-ций. Данную идею поясняет рисунок 12.

Рисунок 12

(G,G’)

(M’,M)

(C,D)

(A,B)

(G,G’)B

AC

D

D-miner resultsMinimal extent Minimal intent Number of

size size concepts0 0 8 950 74010 10 3 030 33515 10 759 96315 15 150 98315 20 14 22620 15 661

21 / 58

Page 26: CoClus ICDM Workshop talk

Detecting large market sectors with OA-biclustering

[Ignatov et al., 2010], OA-biclustering, O(|G||M ||I|)

OA-biclustering results

Threshold, ρmin Number of OA-biclusters0 923450.1 897350.2 808930.3 658810.4 456650.5 259210.6 100660.7 20810.8 1650.9 31 0|L| 8 950 740

22 / 58

Page 27: CoClus ICDM Workshop talk

Examples of market sectors

Hosting market

{a�ordable hosting web, business hosting web, cheap hosting, cheap hosting

site web, cheap hosting web, company hosting web, cost hosting low web,

discount hosting web, domain hosting, hosting internet, hosting page web,

hosting service, hosting services web, hosting site web, hosting web}

Hotel market

{ angeles hotel los, atlanta hotel, baltimore hotel, dallas hotel, denver hotel,

diego hotel san, francisco hotel san, hotel houston, hotel miami, hotel new

orleans, hotel new york, hotel orlando, hotel philadelphia, hotel seattle,

hotel vancouver }

23 / 58

Page 28: CoClus ICDM Workshop talk

Recommendations based on association rules

[Szathmary, 2005]

Coron system, Zart algorithm, infortmative basis of association rules

Examples

minsupp=30 minconf=0,9

{florist} → {flower} supp=33 [1.65%]; conf=0.92;

{gift graduation} → {anniversary gift}, supp=41 [2.05%];conf=0.82;

Informative basis statistics

min_supp max_supp min_conf max_conf number of rules

30 86 0,9 1 101 39130 109 0,8 1 144 043

24 / 58

Page 29: CoClus ICDM Workshop talk

Morphology-based metarules

t � advertising phrase, t = {w1, w2, . . . , wn}si = stem(wi) � stem of word wi

stem(t) =⋃istem(wi) � set of stem of words t

KTS = (T, S, ITS) � formal context, where T � set of all

phrases and S � set of all stems of phrases from T , i.e..S =

⋃istem(ti)

tIs denotes that the set of stems of phrase t contains s

25 / 58

Page 30: CoClus ICDM Workshop talk

Morphology-based metarules

A toy example of context KFT for �long distance calling� market

�rm \ phrase call calling calling carrier cheapdistance distance distance distance distancelong long long plan long long

f1 x x xf2 x x xf3 x xf4 x x xf5 x x x x

26 / 58

Page 31: CoClus ICDM Workshop talk

Morphology-based metarules

A toy example of context KTS for �long distance calling� market

phrase \ stem call carrier cheap distanc long plan

call distance long x x xcalling distance long x x x

calling distance long plan x x x xcarrier distance long x x xcheap distance long x x x

27 / 58

Page 32: CoClus ICDM Workshop talk

Morphology-based metarules

Example of metarules

tFT−−→ sITS

i

{last minute vacation} → {last minute travel}Supp= 19 Conf= 0,90

tFT−−→

⋃i

sITSi

{mail order phentermine} →{adipex online order, adipex order, adipex phentermine, . . . ,phentermine prescription, phentermine purchase, phentermine sale}Supp= 19 Conf= 0,95

28 / 58

Page 33: CoClus ICDM Workshop talk

Morphology-based metarules

Example of metarules

tFT−−→ (

⋃i

si)ITS

{distance long phone} →{call distance long phone, carrier distance long phone, . . . ,distance long phone rate, distance long phone service}Supp= 37 Conf= 0,88

t1FT−−→ t2, where t

ITS2 ⊆ tITS

1

{ink jet} → {ink}, Supp= 14 Conf= 0,7

29 / 58

Page 34: CoClus ICDM Workshop talk

Morphology-based metarules

min_conf = 0.5

Average supp and conf for morphological metarules for min_conf = 0, 5

Rule types Average supp Average value of Number ofconf rules

tFT−−→ sITS

i 15 0,64 454

tFT−−→

⋃i

sITSi 15 0,63 75

tFT−−→ (

⋃i

si)ITS 18 0,67 393

tFT−−→ ti, where

tITSi ⊆ tITS 21 0,70 3922

tFT−−→

⋃i

ti, where

tITSi ⊆ tITS 20 0,69 673

30 / 58

Page 35: CoClus ICDM Workshop talk

Experimental validation

Results of cross-validation for association rules

Number of Number of average_conf Number of average_confrules rules with rules with (min_conf=0.5)

sup > 0 min_conf=0.51 147170 73025 0,77 65556 0,842 69028 68709 0,93 68495 0,933 89332 89245 0,95 88952 0,954 107036 93078 0,84 86144 0,905 152455 126275 0,82 113008 0,906 117174 114314 0,89 111739 0,917 131590 129826 0,95 128951 0,968 134728 120987 0,96 106155 0,979 101346 67873 0,72 52715 0,9210 108994 107790 0,93 106155 0,94

means 115885 99112 0,87 92787 0,92

31 / 58

Page 36: CoClus ICDM Workshop talk

Ontology-based metarules

Constructing ontologies

drug

acid vitamin

B D

Operators and rules

Generalization operator gi(.) : T → T and neighborhood operatorn(.) : T → T

A generalization rule t→ gi(t) and a neighborhood rule t→ n(t)

32 / 58

Page 37: CoClus ICDM Workshop talk

Ontological metarules

Examples of metarules for pharmaceutical market.

t→ g1(t)

{d vitamin} → {vitamin }, Supp= 19 Conf= 0,90

t→ n(t)

{b vitamin} → { b complex vitamin, b12 vitamin, c vitamin, dvitamin, discount vitamin, e vitamin, herb vitamin, mineral vitamin,multi vitamin, supplement vitamin} Supp= 18 Conf= 0,7

33 / 58

Page 38: CoClus ICDM Workshop talk

Additional experiments

Experiments settings

Two algorithms: sequential and parallel

C#, Microsoft Visual Studio 2008.

Parallelized by Task Parallel Library èç Microsoft .NET

Framework 4.0.

Intel Pentuim IV Core 2 Duo, 2 GHz, RAM 3Gb

34 / 58

Page 39: CoClus ICDM Workshop talk

Experiments data

Data sets

UCI Machine Learning Repository

Dataset # objects # attributes |I| Density |B(G,M, I)|advertising 2000 3000 92 345 0,015 8 950 740breast-cancer 286 43 2851 0,232 9918�are 1389 49 18057 0,265 28742postoperative 90 26 807 0,345 2378SPECT 267 23 2042 0,333 21550vote 435 18 3856 0,492 10644zoo 101 28 862 0,305 379

35 / 58

Page 40: CoClus ICDM Workshop talk

Experiments

Dependency between the number of biclusters and minimal

density threshold ρmin

Dataset advertising breast-cancer �are postoperative SPECT vote|B(G,M, I)| 8950740 9918 28742 2378 21550 10644

ρ = 0 92345 2851 18057 807 2042 3856ρ = 0, 1 89735 2851 18057 807 2042 3856ρ = 0, 2 80893 2851 18057 807 2042 3856ρ = 0, 3 65881 2849 18050 807 2042 3855ρ = 0, 4 45665 2678 17988 807 2029 3829ρ = 0, 5 25921 1908 17720 725 1753 3527ρ = 0, 6 10066 310 16459 402 835 2575ρ = 0, 7 2081 17 9353 18 262 1458ρ = 0, 8 165 2 1450 2 85 382ρ = 0, 9 3 2 293 2 32 33ρ = 1 0 2 3 2 12 1

36 / 58

Page 41: CoClus ICDM Workshop talk

Experiments

Fraction of generated concepts and the number of biclusters

Dataset advertising breast-cancer �are postoperative SPECT voteReduction 96,9 3,5 1,6 2,9 10,6 2,8

37 / 58

Page 42: CoClus ICDM Workshop talk

Experiments

Execution time of OA-biclustering algorithms for advertising

dataset

38 / 58

Page 43: CoClus ICDM Workshop talk

Triadic Formal Concept Analysis[F.Lehman & R.Wille, 1995]

De�nition 1

Triadic context K = (G,M,B, Y ) consists of set G (objects), M(attributes), B (conditions) and ternary relations

Y ⊆ G×M ×B. Triple (g,m, b) ∈ Y means that the object ghas the attribute m under the condition b.

De�nition 2

(Formal) triconcept of K is a triple (X,Y, Z) which is maximal

w.r.t. its components inclusion, i.e. X ⊆ G, Y ⊆M , Z ⊆ B è

X × Y × Z ⊆ Y

39 / 58

Page 44: CoClus ICDM Workshop talk

Primes and double primes operators

Prime operators of Their double prime1-sets counterparts

m′ = { (g, b) |(g,m, b) ∈ Y } m′′ = { m̃ |(g, b) ∈ m′ and (g, m̃, b) ∈ Y }

g′ = { (m, b) |(g,m, b) ∈ Y } g′′ = { g̃ |(m, b) ∈ g′ and (g̃,m, b) ∈ Y }

b′ = { (g,m) |(g,m, b) ∈ Y } b′′ = { b̃ |(g,m) ∈ b′ and (g,m, b̃) ∈ Y }

40 / 58

Page 45: CoClus ICDM Workshop talk

Box operators[D. Ignatov, S. Kuznetsov et al., 2011]

g� = { gi | (gi, bi) ∈ m′ or (gi,mi) ∈ b′}m� = { mi | (mi, bi) ∈ g′ or (gi,mi) ∈ b′}b� = { bi | (gi, bi) ∈ m′ or (mi, bi) ∈ g′}

41 / 58

Page 46: CoClus ICDM Workshop talk

Tricluster de�nition[D. Ignatov, S. Kuznetsov et al., 2011] accepted to General Systems Journal

Let K = (G,M,B, Y ) be a formal context.

De�nition 1

Tricluster is a triple T = (g�,m�, b�), where (g,m, b) ∈ Y .

De�nition 2

Tricluster density ρ(A,B,C) is de�ned as number of from Y in the

tricluster (A,B,C), i.e ρ(A,B,C) = |I∩A×B×C||A||B||C| .

De�nition 3

Tricluster T = (A,B,C) is called dense if its density exceeds a prede�nedminimal threshold, i.e. ρ(T ) ≥ ρmin.

For a given formal context K = (G,M,B, Y ) denote by T(G,M,B, Y ) the

set of all its (dense) triclusters.

42 / 58

Page 47: CoClus ICDM Workshop talk

Tricluster de�nition[D. Ignatov, S. Kuznetsov et al., 2011] accepted to General Systems Journal

Let K = (G,M,B, Y ) be a formal context.

De�nition 1

Tricluster is a triple T = (g�,m�, b�), where (g,m, b) ∈ Y .

De�nition 2

Tricluster density ρ(A,B,C) is de�ned as number of from Y in the

tricluster (A,B,C), i.e ρ(A,B,C) = |I∩A×B×C||A||B||C| .

De�nition 3

Tricluster T = (A,B,C) is called dense if its density exceeds a prede�nedminimal threshold, i.e. ρ(T ) ≥ ρmin.

For a given formal context K = (G,M,B, Y ) denote by T(G,M,B, Y ) the

set of all its (dense) triclusters.

42 / 58

Page 48: CoClus ICDM Workshop talk

Tricluster de�nition[D. Ignatov, S. Kuznetsov et al., 2011] accepted to General Systems Journal

Let K = (G,M,B, Y ) be a formal context.

De�nition 1

Tricluster is a triple T = (g�,m�, b�), where (g,m, b) ∈ Y .

De�nition 2

Tricluster density ρ(A,B,C) is de�ned as number of from Y in the

tricluster (A,B,C), i.e ρ(A,B,C) = |I∩A×B×C||A||B||C| .

De�nition 3

Tricluster T = (A,B,C) is called dense if its density exceeds a prede�nedminimal threshold, i.e. ρ(T ) ≥ ρmin.

For a given formal context K = (G,M,B, Y ) denote by T(G,M,B, Y ) the

set of all its (dense) triclusters.

42 / 58

Page 49: CoClus ICDM Workshop talk

Tricluster de�nition[D. Ignatov, S. Kuznetsov et al., 2011] accepted to General Systems Journal

Let K = (G,M,B, Y ) be a formal context.

De�nition 1

Tricluster is a triple T = (g�,m�, b�), where (g,m, b) ∈ Y .

De�nition 2

Tricluster density ρ(A,B,C) is de�ned as number of from Y in the

tricluster (A,B,C), i.e ρ(A,B,C) = |I∩A×B×C||A||B||C| .

De�nition 3

Tricluster T = (A,B,C) is called dense if its density exceeds a prede�nedminimal threshold, i.e. ρ(T ) ≥ ρmin.

For a given formal context K = (G,M,B, Y ) denote by T(G,M,B, Y ) the

set of all its (dense) triclusters.

42 / 58

Page 50: CoClus ICDM Workshop talk

Tricluster properties[D. Ignatov, S. Kuznetsov et al., 2011]

Property 1

For every triconcept (A,B,C) of the triadic context K = (G,M,B, Y ) andnon-empty sets A, B è C it holds that ρ(A,B,C) = 1.

Property 2

For every tricluster (A,B,C) of the triadic context K = (G,M,B, Y ) itholds that 0 ≤ ρ(A,B,C) ≤ 1.

43 / 58

Page 51: CoClus ICDM Workshop talk

Useful property[D. Ignatov and S. Kuznetsov et al., 2011]

Proposition

Let K = (G,M,B, Y ) be a triadic context and ρmin = 0. For everytriconcept Tc = (Ac, Bc, Cc) ∈ T(G,M,B, Y ) there exist a triclusterT = (A,B,C) ∈ T(G,M,B, Y ) such that Ac ⊆ A,Bc ⊆ B,Cc ⊆ C.

44 / 58

Page 52: CoClus ICDM Workshop talk

Folksonomy[T. van der Wal, 2004]

De�nition

Quadruple (U, T,R, Y ) is called folksonomy, where U is a set of users, T is aset of tags, R is a set of resources, and Y ⊆ U × T ×R.

Triple (u, t, r) ∈ Y denotes that the user u assigns the tag t to the resource

r.

45 / 58

Page 53: CoClus ICDM Workshop talk

Folksonomy exampleIt is inspired by Bibsonomy (http://bibsonomy.org).

46 / 58

Page 54: CoClus ICDM Workshop talk

Example: triconcepts VS triclusters.

Folksonomy example

t1 t2 t3u1 × ×u2 × × ×u3 × × ×

r1

t1 t2 t3u1 × × ×u2 × ×u3 × × ×

r2

t1 t2 t3u1 × × ×u2 × × ×u3 × ×

r3

T = {(∅, {t1, t2, t3}, {r1, r2, r3}), ({u1}, {t2, t3}, {r1, r2, r3}), . . .({u1, u2, u3}, {t1, t2}, {r3})}|T| = 33 = 27

T = ({u1, u2, u3}, {t1, t2, t3}, {r1, r2, r3}), ρ = 0.89

47 / 58

Page 55: CoClus ICDM Workshop talk

Bibsonomy Data

Bibsonomy.org (ECML PKDD Discovery Challenge, 2008)

1. user (number, no user names available)

2. tag

3. content_id (matches bookmark.content_id or

bibtex.content_id)

4. content_type (1 = bookmark, 2 = bibtex)

5. date

The folksonomy size: |U | = 2 337 users, |T | = 67 464 tags,

|R| = 28 920 resources that related by |Y | = 816 197

triples. The size of the cuboid 4 559 624 602 560 cells.

ρ(U, T,R) ≈ 1, 8 · 10−7

Implementation: Python 2.7.1

System parameters: Pentium Core Duo, 2 GHz, 2 GB

RAM.48 / 58

Page 56: CoClus ICDM Workshop talk

Bibsonomy Data. Top-11 tags

Tag Frequency of assignments

(user, document)

imported 66636

public 15666

system:imported 11294

nn 9147

video 7610

books 6214

software 5021

tools 4423

web2.0 4215

web 4071

blog 3439

49 / 58

Page 57: CoClus ICDM Workshop talk

Experiments

Results for k �rst triples of data set tas with ρmin = 0

k |U | |T | |R| |T| |T| Trias, s TriclEx,s TriclProb,s100 1 47 52 57 1 0.2 0.2 0.21000 1 248 482 368 1 1 1 110000 1 444 5193 733 1 2 46,7 47100000 59 5823 28920 22804 4462 3386 10311 976200000 340 14982 61568 - 19053 > 24 h > 24h 3417

50 / 58

Page 58: CoClus ICDM Workshop talk

Experiments

Density of conceptual triclusters distribution for 200 000 �rst

triples of tas data set with ρmin = 0

Lower bound ρ Upper bound ρ Number of triclusters

0 0,05 186170,05 0,1 1950,1 0,2 1120,2 0,3 400,3 0,4 200,4 0,5 100,5 0,6 80,6 0,7 10,7 0,8 10,8 0,9 00,9 1 49

51 / 58

Page 59: CoClus ICDM Workshop talk

Experimental Comparison of Triclustering algorithms[D. Gnatyshak et al., 2012]

Algorithms

TRIAS, R. J�aschke (2006)

OAC triclustering based on (a) box and (b) primes

operators, D. Ignatov et al. (2011, 2012)

TriBox, B. Mirkin and A. Kramarenko (2011)

SpecTric, D. Ignatov and Z. Sekinaeva (2011)

52 / 58

Page 60: CoClus ICDM Workshop talk

Experiments on arti�cial and real data[D. Gnatyshak et al., 2011]

Table: Contexts for experiments on time, number of triclusters,

density, coverage and diversity.

Context |G| |M | |B| # triples Density

Uniform random context 30 30 30 2660 0,0985

BibSonomy 51 924 2844 3000 2,2385e-005

53 / 58

Page 61: CoClus ICDM Workshop talk

Experiments on noise-tolerance

Tricontext with three dense 10× 10× 10 cuboids on its main

diagonal.

Noise-tolerance vs probability of a triple inversion54 / 58

Page 62: CoClus ICDM Workshop talk

Experiments

Table: Experiments: time, number of triclusters, density, coverage

and diversity.

Algorithm T , ms N ρ C D DG DM DBUniform random context

OAC (�) 407 73 9,88 100 0 0 0 0OAC (′) 312 2659 32,23 100 92,51 60,07 59,80 59,45SpecTric. 277 5 8,74 8,84 100 100 100 100TriBox 6218 1011 74,00 96,02 97,42 66,25 79,53 84,80TRIAS 29367 38356 100 100 99,99 99,93 4,07 3,51

BibSonomy

OAC (�) 19297 398 4,16 100 79,59 67,28 42,83 79,54OAC (′) 13556 1289 9466 100 99,74 88,58 99,51 99,53SpecTric 5906563 2 50 100 100 100 100 100TriBox Time > 24 hoursTRIAS 110554 1305 100 100 99,98 91,70 99,78 99,92

55 / 58

Page 63: CoClus ICDM Workshop talk

Conclusion

Results of FCA-based OA-biclustering show the possibility

of detecting reasonably small number of relatively large

advertisement markets given by companies and advertising

phrases.

Advantage of the morphological metarules approach

consists in the possibility of detecting related advertisement

phrases not given directly in data

Proposed OAC-triclustering techniques are e�ective and

e�cient means of reducing the numbers of patterns in

comparison with triconcepts (execution time and output

size are polynomial).

56 / 58

Page 64: CoClus ICDM Workshop talk

Forthcoming investigations

We are developing a multimodal clustering framework for

relational data � MMC Toolbox.

Conducting more experiments with triclustering for making

recommendations and �nding tricommunities.

Developing multimodal triclustering models and methods

for crowdsourcing platforms.

57 / 58

Page 65: CoClus ICDM Workshop talk

Conclusion

Thank you for your attention!

58 / 58