performance metrics for graph mining tasks€¦ · how a predictive data mining model will perform...

Performance Metrics for Graph Mining Tasks

1

Outline

• Introduction to Performance Metrics

• Supervised Learning Performance Metrics

• Unsupervised Learning Performance Metrics

• Optimizing Metrics

• Statistical Significance Techniques

• Model Comparison

2

Outline







3

Introduction to Performance Metrics

Performance metric measures how well your data mining algorithm is performing on a given dataset.

For example, if we apply a classification algorithm on a dataset, we first check to see how many of the data points were classified correctly. This is a performance metric and the formal name for it is “accuracy.”

Performance metrics also help us decide is one algorithm is better or worse than another.

For example, one classification algorithm A classifies 80% of data points correctly and another classification algorithm B classifies 90% of data points correctly. We immediately realize that algorithm B is doing better. There are some intricacies that we will discuss in this chapter.

4

Outline







5

Supervised Learning Performance Metrics

Metrics that are applied when the ground truth is known (E.g., Classification tasks)

Outline:

• 2 X 2 Confusion Matrix

• Multi-level Confusion Matrix

• Visual Metrics

• Cross-validation

6

2X2 Confusion Matrix

7

PredictedClass

+ -

ActualClass

+ f++ f+- C = f++ + f+-

- f-+ f-- D = f-+ + f--

A = f++ + f-+ B = f+- + f-- T = f+++ f-++ f+-+ f--

An 2X2 matrix, is used to tabulate the results of 2-class supervised learning problemand entry (i,j) represents the number of elements with class label i, but predicted tohave class label j.

True Positive

False Positive

False Negative

True Negative

+ and – are two class labels

2X2 Confusion MetricsExample

8

VertexID

ActualClass

PredictedClass

1 + +

2 + +

3 + +

4 + +

5 + -

6 - +

7 - +

8 - -

PredictedClass

+ -

ActualClass

+ 4 1 C = 5

- 2 1 D = 3

A = 6 B = 2 T = 8

Corresponding2x2 matrix for the given table

Results from a Classification Algorithms

• True positive = 4• False positive = 1• True Negative = 1• False Negative =2

2X2 Confusion MetricsPerformance Metrics

Walk-through different metrics using the following example

9

1. Accuracy is proportion of correct predictions

2. Error rate is proportion of incorrect predictions

3. Recall is the proportion of “+” data points predicted as “+”

4. Precision is the proportion of data points predicted as “+” that are truly “+”

Multi-level Confusion Matrix

An nXnmatrix, where n is the number of classes and entry (i,j) represents the

number of elements with class label i, but predicted to have class label j

10

Multi-level Confusion MatrixExample

11

Predicted Class Marginal Sum of

Actual ValuesClass 1 Class 2 Class 3

ActualClass

Class 1 2 1 1 4

Class 2 1 2 1 4

Class 3 1 2 3 6

Marginal Sum of Predictions

4 5 5 T = 14

Multi-level Confusion MatrixConversion to 2X2

Predicted Class

Class 1 Class 2 Class 3

ActualClass

Class 1 2 1 1

Class 2 1 2 1

Class 3 1 2 3

2X2 Matrix Specific to Class 1

f++

f-+ f

+-

f--

Predicted Class

Class 1 (+)Not Class 1

(-)

ActualClass

Class 1 (+) 2 2 C = 4

Not Class 1 (-) 2 8 D = 10

A = 4 B = 10 T = 14

We can now apply all the 2X2 metrics

Accuracy = 2/14Error = 8/14Recall = 2/4Precision = 2/4

Multi-level Confusion MatrixPerformance Metrics

13

Predicted Class

Class 1 Class 2 Class 3

ActualClass

Class 1 2 1 1

Class 2 1 2 1

Class 3 1 2 3

1. Critical Success Index or Threat Score is the ratio of correct predictions for class L to the sum of vertices that belong to L and those predicted as L

2. Bias - For each class L, it is the ratio of the total points with class label L to the number of points predicted as L.

Bias helps understand if a model is over or under-predicting a class

Confusion MetricsR-code

14

• library(PerformanceMetrics)

• data(M)

• M

• [,1] [,2]

• [1,] 4 1

• [2,] 2 1

• twoCrossConfusionMatrixMetrics(M)

• data(MultiLevelM)

• MultiLevelM

• [,1] [,2] [,3]

• [1,] 2 1 1

• [2,] 1 2 1

• [3,] 1 2 3

• multilevelConfusionMatrixMetrics(MultiLevelM)

Visual Metrics

Metrics that are plotted on a graph to obtain the visual picture of the performance of two class classifiers

15

00

1

1

False Positive Rate

Tr

ue

Po

sit

ive

Ra

te

(0,1) - Ideal

(0,0) Predicts the –veclass all the time

(1,1) Predicts the +veclass all the time

AUC = 0.5

Plot the performance of multiple models to decide which one performs best

ROC plot

Understanding Model Performance based on ROC Plot

16

00

1

1

False Positive Rate

Tr

ue

Po

sit

ive

Ra

te

AUC = 0.5 Models that lie in this area perform worse than randomNote: Models here can

be negated to move them to the upper right

corner

Models that lie in this upper left have good

performanceNote: This is where you

aim to get the model

1. Models that lie in lower left are conservative.

2. Will not predict “+” unless strong evidence

3. Low False positives but high False Negatives

1. Models that lie in upper right are liberal.

2. Will predict “+” with little evidence

3. High False positives

ROC Plot Example

17

00

1

1

False Positive Rate

Tru

e P

osi

tive

Rat

eM1 (0.1,0.8)

M2 (0.5,0.5)M3 (0.3,0.5)

M1’s performance occurs furthestin the upper-right direction and hence is considered

the best model.

Cross-validation

Cross-validation also called rotation estimation, is a way to analyze how a predictive data mining model will perform on an unknown dataset, i.e., how well the model generalizes

Strategy:

1. Divide up the dataset into two non-overlapping subsets

2. One subset is called the “test” and the other the “training”

3. Build the model using the “training” dataset

4. Obtain predictions of the “test” set

5. Utilize the “test” set predictions to calculate all the performance metrics

18

Typically cross-validation is performed for multiple iterations,selecting a different non-overlapping test and training set each time

Types of Cross-validation

• hold-out: Random 1/3rd of the data is used as test and remaining 2/3rd as training

• k-fold: Divide the data into k partitions, use one partition as test and remaining k-1 partitions for training

• Leave-one-out: Special case of k-fold, where k=1

19

Note: Selection of data points is typically done in stratified manner, i.e., the class distribution in the test set is similar to the training set

Outline







20

Unsupervised Learning Performance Metrics

Metrics that are applied when the ground truth is not always available (E.g., Clustering tasks)

Outline:

• Evaluation Using Prior Knowledge

• Evaluation Using Cluster Properties

21

Evaluation Using Prior Knowledge

To test the effectiveness of unsupervised learning methods is by considering adataset D with known class labels, stripping the labels and providing the set asinput to any unsupervised leaning algorithm, U. The resulting clusters are thencompared with the knowledge priors to judge the performance ofU

To evaluate performance

1. Contingency Table

2. Ideal and Observed Matrices

22

Contingency Table

23

Cluster

Same Cluster Different Cluster

ClassSame Class u11 u10

Different Class u01 u00

(A) To fill the table, initialize u11,u01,u10,u00 to 0(B) Then, for each pair of points of form (v,w):

1. if v and w belong to the same class and cluster then increment u11

2. if v and w belong to the same class but different cluster then increment u10

3. if v and w belong to the different class but same cluster then increment u01

4. if v and w belong to the different class and cluster then increment u00

Contingency TablePerformance Metrics

• Rand Statistic also called simple matching coefficient is a measure where both placing a pair of points with the same class label in the same cluster and placing a pair of points with different class labels in different clusters are given equal importance, i.e., it accounts for both specificity and sensitivity of the clustering

• Jaccard Coefficient can be utilized when placing a pair of points with the same class label in the same cluster is primarily important

24

Example Matrix Cluster

Same Cluster Different Cluster

ClassSame Class 9 4

Different Class 3 12

Ideal and Observed Matrices

Given that the number of points is T, the ideal-matrix is a TxT matrix, where eachcell (i,j) has a 1 if the points i and j belong to the same class and a 0 if they belong todifferent clusters. The observed-matrix is a TxT matrix, where a cell (i,j) has a 1 ifthe points i and j belong to the same cluster and a 0 if they belong to differentcluster

• Mantel Test is a statistical test of the correlation between two matrices of the same rank. The two matrices, in this case, are symmetric and, hence, it is sufficient to analyze lower or upper diagonals of each matrix

25

Evaluation Using Prior KnowledgeR-code


• data(ContingencyTable)

• ContingencyTable

• [,1] [,2]

• [1,] 9 4

• [2,] 3 12

• contingencyTableMetrics(ContingencyTable)

26

Evaluation Using Cluster Properties

In the absence of prior knowledge we have to rely on the information fromthe clusters themselves to evaluate performance.

1. Cohesionmeasures how closely objects in the same cluster are related

2. Separation measures how distinct or separated a cluster is from allthe other clusters

Here, gi refers to cluster i, W is total number of clusters, x and y are data points, proximity can be any similarity measure (e.g., cosine similarity)

We want the cohesion to be close to 1 and separation to be close to 0

27

Outline







28

Optimizing Metrics

Performance metrics that act as optimization functions for a data mining algorithm

Outline:

• Sum of Squared Errors

• Preserved Variability

29

Sum of Squared Errors

Squared sum error (SSE) is typically used in clustering algorithms to measure the quality of the clusters obtained. This parameter takes into consideration the distance between each point in a cluster to its cluster center (centroid or some other chosen representative).

For dj, a point in cluster gi, where mi is the cluster center of gi, and W, the total number of clusters, SSE is defined as follows:

This value is small when points are close to their cluster center, indicating a good clustering. Similarly, a large SSE indicates a poor clustering. Thus, clustering algorithms aim to minimize SSE.

30

Preserved Variability

Preserved variability is typically used in eigenvector-based dimension reductiontechniques to quantify the variance preserved by the chosen dimensions. Theobjective of the dimension reduction technique is to maximize this parameter.

Given that the point is represented in r dimensions (k << r), the eigenvalues are λ1>=λ2>=….. λr-1>=λr. The preserved variability (PV) is calculated as follows:

The value of this parameter depends on the number of dimensions chosen: themore included, the higher the value. Choosing all the dimensions will result in theperfect score of 1.

31

Outline







32

Statistical Significance Techniques

• Methods used to asses a p-value for the different performance metrics

Scenario:– We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels

like 0.99 is a very good score.

– However, it is possible that the underlying data is structured in such a way that you would get 0.99 no matter how you cluster the data.

– Thus, 0.99 is not very significant. One way to decide that is by using statistical significance estimation.

We will discuss the Monte Carlo Procedure in next slide!

33

Monte Carlo Procedure Empirical p-value Estimation

Monte Carlo procedure uses random sampling to assess the significance of aparticular performance metric we obtain could have been attained at random.

For example, if we obtain a cohesion score of a cluster of size 5 is 0.99, we would beinclined to think that it is a very cohesive score. However, this value could haveresulted due to the nature of the data and not due to the algorithm. To test thesignificance of this 0.99 value we

1. Sample N (usually 1000) random sets of size 5 from the dataset

2. Recalculate the cohesion for each of the 1000 sets

3. Count R: number of random sets with value >= 0.99 (original score of cluster)

4. Empirical p-value for the cluster of size 5 with 0.99 score is given by R/N

5. We apply a cutoff say 0.05 to decide if 0.99 is significant

Steps 1-4 is the Monte Carlo method for p-value estimation.

34

Outline







35

Model Comparison

Metrics that compare the performance of different algorithms

Scenario:

1) Model 1 provides an accuracy of 70% and Model 2 provides an accuracy of 75%

2) From the first look, Model 2 seems better, however it could be that Model 1 is predicting Class1 better than Class2

3) However, Class1 is indeed more important than Class2 for our problem

4) We can use model comparison methods to take this notion of “importance” into consideration when we pick one model over another

Cost-based Analysis is an important model comparison method discussedin the next few slides.

36

Cost-based Analysis

In real-world applications, certain aspects of model performance are consideredmore important than others. For example: if a person with cancer was diagnosed ascancer-free or vice-versa then the prediction model should be especially penalized.This penalty can be introduced in the form of a cost-matrix.

37

CostMatrix

PredictedClass

+ -

ActualClass

+ c11 c10

- c01 c00

Associated with f11 or u11 Associated with f01 or u01

Associated with f10 or u10

Associated with f00 or u00

Cost-based AnalysisCost of a Model

The cost and confusion matrices for Model M are given below

Cost of Model M is given as:

38

Cost Matrix

PredictedClass

+ -

ActualClass

+ c11 c10

- c01 c00

Confusion Matrix

PredictedClass

+ -

ActualClass

+ f11 f10

- f01 f00

Cost-based AnalysisComparing Two Models

This analysis is typically used to select one model when we have more than one choice through using different algorithms or different parameters to the learning algorithms.

39

CostMatrix

PredictedClass

+ -

ActualClass

+ -20 100

- 45 -10

ConfusionMatrix of

Mx

PredictedClass

+ -

ActualClass

+ 4 1

- 2 1

ConfusionMatrix of

My

PredictedClass

+ -

ActualClass

+ 3 2

- 2 1

Cost of My : 200Cost of Mx: 100

CMx< CMy

Purely, based on cost model, Mx is a better model

Cost-based AnalysisR-code


• data(Mx)

• data(My)

• data(CostMatrix)

• Mx

• [,1] [,2]

• [1,] 4 1

• [2,] 2 1

• My

• [,1] [,2]

• [1,] 3 2

• [2,] 2 1

• costAnalysis(Mx,CostMatrix)

• costAnalysis(My,CostMatrix)

40

performance metrics for graph mining tasks€¦ · how a predictive data mining model will perform...

Documents