[email protected]/~ssamal/ml/ml_project_report.pdfcsce878 – fca based approaches for...

CSCE878 – FCA Based Approaches for Classification Problems

CSCE878 Project Report MACHINE LEARNING December 13, 2014 Suraj Ketan Samal

[email protected]

FCA-Based Approaches for Classification Problems

Abstract: Formal concept analysis is a mathematical theory based on lattice and order theory used for data analysis and knowledge representation. It involves conceptual thinking and has applications in areas like machine learning, data mining, knowledge management, semantic web, software engineering, chemistry and biology. One of the very basic problems in machine learning is the classification of data into various categories (or labels). This problem has been extensively studied and finds its usage in variety of applications like email-filtering, text recognition and clustering. Classification problems traditionally fall into two main categories, supervised and un-supervised learning. In this project, we first propose a “crude” (naïve) approach of using FCA for classification problems and then study various existing approaches in literature. Further, we analyze and compare them with each other on common grounds and discuss on future directions and open issues.

1. Introduction:

Machine learning can be said to be a scientific study of constructing computer programs that can learn from data and improve with experience. FCA has originally been used to exploit the relationship between a set of objects and their properties, thereby learning predictable patterns from available data to build a knowledge base of the domain. Use of FCA combined with various machine learning models and domain expertise has yielded in many interesting results.

A very old model of use of FCA in Machine learning is discussed in CHARADE [1]. It constructs classification rules based on k-DNF expressions built from the generated lattice and claims to have better performance than the basic decision tree algorithms (ID3). Ru-Learner [7] formally defines and employs an algorithm to generate rules based on having to find recursively the minimum set of attributes to classify maximum set of instances of a specific category. In [4, 5, and 6], the authors describe the use of a lattice to generate a minimal set of factors that can replace the attributes and help in better classification of data. The notion of iceberg concept lattices introduced in [2, 3] present novel ways to apply FCA to large datasets based on approximation and noise filtering techniques.

This document is organized into four sections. Section 2 discusses a brief history and background of Formal Concept Analysis (FCA). Section 3 discusses the FCA theory with an example and discusses its applications in various areas. Section 4 describes our basic crude (“naive”) approach of using FCA for classification problems. It also talks about the setup and datasets used for our experimentation. Section 5 reviews various important existing approaches in literature. Finally, Section 6 discusses future directions based on the analysis of the existing approaches.

mailto:[email protected]


2. Background of FCA

Formal Concept analysis (FCA) was formally defined by Rudolf Wille[14] in his paper in 1982. He proposed an elegant approach based on the lattice theory to organize concepts in hierarchies and derive meaningful relationships from them. A lot of work has been then done to generalize FCA based on other theories like Rough Set Theory, Fuzzy Set Theory, Monotone Concepts and AFS Algebra. A detailed survey of various FCA variants and its applications can be found in [15].

Figure 1 – Basic workflow of use of Formal Concept Analysis (FCA) in a specific domain

FCA has been used in variety of different applications since its inception. A basic workflow of using FCA is described in Figure 1. Formal Concept Analysis exploits the relationship between the set of objects and their attributes in a particular domain and represents it in the form of a binary matrix known as Formal Context. Application of FCA theory to the context yields in a set of formal concepts which are smart combination of objects and attributes that represent the knowledge in the domain. These concepts can then be ordered into a lattice diagram and various rules and relationships can then be extracted for use in various applications within the domain.

1


Figure 2 – Application of FCA to categorize a set of papers in software engineering domain[10].

Figure 2 shows an example of use of a lattice diagram to categorize a set of papers based on various stages in software engineering domain. The papers can be organized in an elegant way to derive relationships yielding in recognizing set of papers that belong to a particular category. Another example in Figure 3 shows the use of FCA in medicine domain to predict which symptoms is the likely cause of certain diseases. Figure 4 shows an example in bioinformatics domain where FCA can be used to predict which genes relate to different behaviors in various organisms.

Figure 3 & 4 – Application of FCA to medicine (to classify a set of symptoms related to a disease) and bioinformatics (to predict which genes are responsible for certain behavior)[8].

Figure 5 shows an excellent example of finding clusters within social networking domain. The members and their participation data in form of cliques is organized into a concept lattice and further interpreted to identify social communities (clusters). Further the paper also exploits the type and strength of role each member plays as part of identified communities.

2


Figure 5 – Application of FCA to identify communities in social networks [9].

3. Formal Concept Analysis (FCA) Theory A domain consists or various objects and their attributes. Each object is categorized by the set of

attributes it possesses and such a relationship can be organized into a matrix known as formal context. A formal context can be defined as a triple (G, M, I) where G is a set of objects, M is a set of attributes, and I is a binary relation between G and M represented by gIm indicating that the object g has the attribute m. There is a Galois connection between G and M defined by

𝐴𝐴′ = { 𝑚𝑚 ∈ 𝑀𝑀 | 𝑔𝑔𝑔𝑔𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎𝑎𝑎 𝑔𝑔 ∈ 𝐴𝐴} for 𝐴𝐴 ⊆ 𝐺𝐺 and

𝐵𝐵′ = { 𝑔𝑔 ∈ 𝐺𝐺 | 𝑔𝑔𝑔𝑔𝑚𝑚 𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎𝑎𝑎 𝑚𝑚 ∈ 𝐵𝐵 for 𝐵𝐵 ⊆ 𝑀𝑀

A formal concept is defined as a pair (A, B) with 𝐴𝐴 ⊆ 𝐺𝐺 and 𝐵𝐵 ⊆ 𝑀𝑀, 𝐴𝐴′ = 𝐵𝐵 and 𝐵𝐵′ = 𝐴𝐴 where A is called the extent and B is called the intent of the concept. The entire list of concepts derived from the context can be arranged in a hierarchy using a subconcept-superconcept relation defined by

(𝐴𝐴1,𝐵𝐵1) ≤ (𝐴𝐴2,𝐵𝐵2) ↔ 𝐴𝐴1 ⊆ 𝐴𝐴2 𝑎𝑎𝑎𝑎𝑎𝑎 𝐵𝐵1 ⊇ 𝐵𝐵2

for concepts (𝐴𝐴1,𝐵𝐵1) and (𝐴𝐴2,𝐵𝐵2) of (G,M,I). If L(G,M,I) is the set of all concepts of context (G,M,I), then ( 𝐿𝐿(𝐺𝐺,𝑀𝑀, 𝑔𝑔),≤ ) can be organized into a complete lattice having a infinima (trivial concept of all objects and no attributes) and suprema (trivial concept of all attributes and no objects). Further details can be seen in Wille’s seminal paper [14] or in his classical book [16].

Several polynomial time algorithms exist for mining the formal concepts from the domain, a survey of these algorithms and their comparison can be seen in [12]. Many simple and open source

3


applications (e.g LatticeMiner, FCAStone,Conexp) exist that generate the lattice and concepts, given a set of objects and attributes as an input. For this project, we use Conexp tool [17] to generate various lattice diagrams.

3.2 An intuitive example

Figure 6 – A formal context example taken from [4].

Figure 6a shows a small example of a formal context defined by a set of animals representing objects and a set of properties exhibited by them representing attributes of the domain. Note the attributes here are multi-valued and hence are scaled (modified) to have only binary values as shown in Figure 6b.

Figure 7 – Lattice diagram of the formal context example described in Figure 6

The corresponding lattice diagram consisting of 21 concepts can be seen in Figure 7. The top and the bottom nodes are trivial concepts and other concepts are identified by the labels of objects and attributes on them or inherited based on the edges connecting the nodes. Note, the edges don’t have any directions indicating that the relation is bi-directional.

4


4. CrudeFCA - A naive classification approach

This section discusses a naïve approach for generating rules by using a crude technique to mine relationships from the FCA lattice and solve the machine learning classification problem.

4.1.1 Description

For the purpose of simplicity, we use the same example shown in Figure 6 to describe our approach. Figure 6 describes of a simple problem of classifying five instances (two positive and three negative) into two categories (mammal or not-mammal). The corresponding lattice generated is shown in Figure 7. We propose the following algorithm described in Figure 8 to extract the classification rules:

Figure 8 – The proposed CrudeFCA algorithm

The application of the “crude” algorithm results in classifying the instances as shown below in Figure 9:

Figure 9 – Application of CrudeFCA to generate classification rules.

The generated classification rules are:

Rule 1: {bt-cold} (-)

5


Rule2: {gb-no} (-) Rule3: {bt-warm, gb-yes} (+)

4.1.2 Application of CrudeFCA to few standard datasets Two standard datasets from UCI (University of California, Irvine) repository were chosen to test

the CrudeFCA algorithm. The properties of the datasets are described in Table-1 below:

Dataset Name No of Classes Test Instances

Training Instances

No of attributes

No of attributes (after scaling)

Congressional Voting Records

(house-votes-84)

2 130 305 16 48

Monks

(monks-1)

2 62 124 6 17

Monks

(monks-2)

2 218 169 6 17

Monks

(monks-3)

2 223 122 6 17

Table 1 – Properties of UCI datasets chosen for testing CrudeFCA Algorithm

For congress-men dataset, the number of instances was randomly partitioned into test-set (30%) and training-set (70%).

4.1.3 Comparison with ID3 (Decision Tree) Algorithm We chose to compare our CrudeFCA implementation with an ID3 implementation which generates a decision tree based on selecting the best-attribute recursively from the range of attributes. The best-attribute for ID3 was determined based on the best split of classification of training instances (entropy). The number of misclassification errors recorded in both the algorithms can be seen in Table 2:

Dataset ID3 (%) CrudeFCA (%)

Training Test Training Test

congressmen 2.96 4.98 0.23 0.00

monks-1 26.61 23.08 0.00 0.00

monks-2 37.87 26.57 0.00 33.67

monks-3 22.13 22.07 0.00 26.29

Table 2 – Misclassification rates (%) of CrudeFCA and ID3 implementations

6


A 8-fold cross validation test was also run to compare CrudeFCA with ID3. The results are shown in Table 3.

Dataset Confidence

90% 95%

congress-men cant-say cant-say

monks (combined) ID3 ID3

Table 3 – Results of 8-fold classification tests on CrudeFCA and ID3 implementations

Takeaways: CrudeFCA is close in performance to ID3. Infact, on the congressmen dataset and monks-1 it performed better than ID3. But its performance degraded on the other datasets (monks-2 & monks-3) which contain noise. CrudeFCA was over-fitting on these datasets (note training error is zero). This was expected behavior of CrudeFCA as it is just trying to find out the best combination of attributes to classify the instances based on brute-force approach and hence unable to handle any noise.

4.1.4 A possible enhancement to CrudeFCA The performance of “CrudeFCA” can possibly be improved by using a threshold (t) and stopping the exploration process after a while to avoid over-fitting on the training data. Figure 10 shows

Figure 10 – Lattice diagram showing division of formal concepts at various levels.

how the generated concept lattice can be divided into various levels based on the distance from the top (infinum). Choosing a threshold (t) as the maximum level (depth) to be explored should result in fewer rules and hence preventing from over-fitting on the training data. It can be seen in Figure 10 that reducing the threshold (t) to 1 can result in removal of third rule {bt-warm, gb-yes}.

This modified algorithm is still to be tested on standard datasets. We plan to test and include the results in next version of this document.

7


5. Existing Literature Review

In this section we review and briefly discuss some of the important existing approaches in literature which use FCA to classification problems in Machine learning. For simplicity and purpose of better understanding, we take the same example that we have been discussing so far in previous sections to explain these approaches.

5.2 Charade (Mouliner, Ganacia 1996)

Charade[1] system proposed a new approach for text-categorization. It generates classification rules based on k-DNF expressions of attributes extracted from the concept lattice.

Figure 11 – Charade: Exploration of classification rules using k-DNF expressions

A k-DNF expression is a disjunction of k attributes that acts as a rule to classify some instances. Authors divide the lattice into description space (or attribute space) and example space (or object/instance space) as shown in Figure 11 and generate a list of k-DNF expressions by exploring the lattice from top till all of the instances are classified. They also define a redundancy parameter (λ) for each concept to measure the degree of overlapping of attributes between concepts. The redundancy parameter (λ) is then varied to see the performance of the algorithm.

The approach is applied on the reuters dataset for text categorization. Results indicate that the approach to be better than decision tree (ID3) and Naïve-Bayes implementations. Varying of redundancy parameter (λ) led to better performance till an optimum point after which the performance again degraded.

8


5.3 Rulearner (Sahami 1995)

Rulearner[7] attempts to generate the classification rules from the lattice by defining certain parameters for the concept nodes. Authors define the concept of Upward Closure (UC) of a node

Figure 12 – Exploration of classification rules using Rulearner Approach

as the set of nodes that are related to the node in upward direction. For example, in Figure 12, UC(A) = { A, B, C, D}. Similarly, Downward closure (DC) of a node is defined as the set of nodes that are related in downward direction. In Figure 12, DC(P) = {P, Q, R}. They also define a parameter, Cover(c) of a node as the number of objects (or instances) classified by the node. This approach starts with finding the nodes (concepts) recursively with maximum cover and non-mixed labeling. After a node is successfully found, other nodes in Downward closure (DC) are deactivated and the algorithm continues till no more objects are left unclassified.

This algorithm is evaluated on monks and breast-cancer datasets and compared with existing C4.5 and CN2 implementations. Results seem to be in favor of RuLearner especially on noisy datasets (e.g monks-3).

5.4 Boolean Factor Analysis (Keprt 2004, Belohlavek 2006)

Boolean factor Analysis (also known as Binary factor Analysis) [5, 6] is a method to break down an existing matrix(X) of size 𝑝𝑝 𝑥𝑥 𝑎𝑎 into two matrices (F) 𝑝𝑝 𝑥𝑥 𝑚𝑚 and (A)𝑚𝑚 𝑥𝑥 𝑎𝑎 (Figure 13).

9


Figure 13 – Decomposition of set of attributes into a set of factors using Boolean Factor Analysis

The basic idea behind this approach is to reduce the set of p attributes by a set of m factors such that𝑚𝑚 ≪≪ 𝑝𝑝. Keprt[5] proposed a simple and elegant method to find an optimal m based on the concept lattice.

Figure 14 – Identification of minimal set of factors using Boolean Factor Analysis (BFA)

Figure 15 – Reduced attribute set after applying BFA to the original attributes set

This is done by exploring the concept lattice down in a top-down fashion and identifying the concepts that consists of maximal set of attributes and classify maximum set of objects. Figure 14 highlights the identified concepts on application of the approach. The associated factors are then generated from the concepts and used a replacement to the attributes. Figure 15 shows a resultant reduced matrix where the attribute space of size 8 has been reduced to 7. Once factors are determined, the classification can be done similar to the decision tree classifier [4].

10


This approach is evaluated on a variety of datasets (breast-cancer, mushroom, tic-tac-toe,

congressmen and zoo). Results show that the classification accuracy is marginally better than ID3 and C4.5 classifiers.

5.5 Iceberg Lattices (Stumme 2002, Andrews 2009)

The concept of Iceberg lattice on basis of Frequent Concept Analysis (a variation of Formal Concept Analysis) was proposed by Stumme[2] in 2002. Authors describe a novel algorithm Titanic for mining relationships in huge datasets.

Figure 16 – Iceberg Lattice showing every node having an additional “minsupp” parameter.

Figure 16 shows an iceberg-lattice of the same example discussed earlier. Each node is associated with a parameter “minsupp” which is the ratio of instances classified by the node. A threshold (t) is then fixed and all nodes with minsupp < t are removed from the lattice. The resulting lattice is then used instead of the original lattice and various algorithms can be used (like in [3]) to mine the classification rules. Choosing a threshold (t) of 3 reduces the lattice in Figure 16 to a simpler one in Figure 17.

11


Figure 12 – Reduced Iceberg lattice of the original lattice shown in Figure 11

This approach works especially well for huge datasets where an approximation is needed as the number of generated concepts can be very high. Authors in [3] describe this approach on the mushroom dataset and indicate its usefulness in removing noise and thus resulting in a readable lattice for mining classification rules.

6. Future Directions

Following are some of the future directions that can be seen as a result of this study:

• Implementation of the existing approaches in literature and comparing them with each other by running on standard machine learning datasets.

• Derivation of theoretical results that can result in generation of better classification algorithms.

• Application of deep FCA theoretical concepts (e.g. temporal decompositions, block relations) and various variants of FCA based on rough-set, fuzzy set, monotone concepts and AFS Algebra needs to be explored.

7. Conclusion

Formal Concept Analysis (FCA) has still not been used extensively in Machine Learning and there is huge potential for its exploitation. Use of FCA and its variants can result in some very interesting results and may lead to design of some very efficient classifiers.

8. References:

[1]Applying an Existing Machine Learning Algorithm to Text Categorization (Moulinier,Ganascia 1996).

[2] Computing iceberg concept lattices with TITANIC (Stumme 2002). [3] Analysis of large data sets using formal concept lattices (Andrews 2010). [4] Preprocessing input data for machine learning by FCA (Outrata 2010).

12


[5] Using Blind Search and Formal Concepts for Binary Factor Analysis (Keprt 2004). [6] On Boolean factor analysis with formal concepts as factors (Belohlavek 2006). [7] RuLearner: Learning Classification Rules Using Lattices (Sahami 1995). [8] Two FCA-Based Methods for Mining Gene Expression Data (Kaytoue 2009). [9] Cliques, Galois lattices, and the structure of human social groups (Freeman 1996).

[10] A Survey of Formal Concept Analysis Support for Software Engineering Activities (Tilley 2005). [11] GALOIS: A lattice conceptual clustering system and its application to browsing retrieval (Carpineto 1992). [12] Comparing Performance of Algorithms for Generating Concept Lattices (Kuzetnkov 2001). [13] Machine Learning on the Basis of Formal Concept Analysis (Kuznetsov 2001). [14] Restructuring lattice theory: an approach based on hierarchies of concepts (Wille 1982). [15] Formal Concept Analysis: Foundations and Applications (Ganter, Stumme, Wille 2005) [16] Formal Concept Analysis: Mathematical Foundations (Ganter, Wille 1999). [17] ConExp Tool: http://conexp.sourceforge.net/

13

[email protected]/~ssamal/ml/ml_project_report.pdfcsce878 – fca based approaches for...

Documents